The add_column function adds a new column. The supported types are string, boolean, byte, short, int, long, float, double, decimal, date, and timestamp.
The any_date function converts the string to date type while parsing the string based on the rule provided. Using the by default rule, it converts a date column given in any format like yyyy.mm.dd, dd.mm.yyyy, dd/MM/yy, yyyy/mm/dd, and other (total 18 different formats) into yyyy-mm-dd format. The rule should be a regular expression.
You can provide explicit regex rules for the column passed in the any_date function. Without an explicit regex, the function will use the default rule.
The any_timestamp function converts a string to timestamp type while parsing the string based on the rule provided. The timestamp format is as per the specified timezone. The rule should be a regular expression.
You can provide explicit regex rules for the column passed in the any_timestamp function. Without an explicit regex, the function will use the default rule.
The change_case function alters the columns case to Lowercase, Uppercase, or Titlecase depending upon the applied lower/upper/title case changes the column values .
The cleanse_column_names function sanatizes column names, following these rules: • Trim leading and trailing spaces • Lowercases the column name • Replaces any character that are not one of [A-Z][a-z][0-9] or _ with an underscore (_)
The fill_null_or_empty function fills column value with a fixed value if it is either null or empty (""). If the column does not exist, then the function will fail.The defaultValue can only be of type string.
The flatten function separates the elements in a repeated field into individual records. This function is useful for the flexible exploration of repeated data. To maintain the association between each flattened value and the other fields in the record, the FLATTEN directive copies all of the other columns into each new record. Note :- Use flatten_outer when array has null values and you want records of root with null in flattened columns.
The hash function generates a message digest. The column is replaced with the digest created using the supplied algorithm. The type of column is a string.
The mask_number function applies substitution masking on the column values. The 'column' specifies the name of an existing column to be masked. The 'pattern' is a substitution pattern to be used to mask the column values. Substitution masking is generally used for masking credit card or social security numbers. The MASK_NUMBER applies substitution masking on the column values. This type of masking is fixed masking, where the pattern is applied on the fixed length string. These rules are used for the pattern: • Use of # will include the digit from the position. • Use x or any other character to mask the digit at that position. • E.g. For SSN '000-00-0000' and pattern: 'XXX-XX-####' output would be like: XXX-XX-0000
The merge function merges two or more columns by inserting a third column specified as asColumn into a row. The values in the third column are merged values from the specified columns delimited by a specified separator.
The parse_as_json function is for parsing a JSON object. The function can operate on String or JSON Object types. It requires spark schema json to parse the json back into dataframe.
The parse_html function is used to convert the HTML-coded string to a normal string without any html tags. Here, asColumn is an optional parameter incase you wish to create a separate column for the processed data. Else, the processed data will replace the original column on which the function is performed. The function works using the jsoup library. More details about this library can be found here: https://github.com/jhy/jsoup
The pivot function is used to pivot/rotate the data from one DataFrame/Dataset column into multiple columns (transform row to columns). Here, values and approach are optional parameters. Also, aggregate_expression requires an alias to be written the same as the column name used with aggregate functions like sum, count, avg, etc. Values can be used to specify only those columns needed after pivot from column . Approach can be set to “two-phase” for running an optimized version query on large datasets.
The split_email function splits/parses an email ID into its two constituent parts: account and domain. After splitting the email address stored in the column within the column property, the directive will create two new columns, appending to the original column, named: column_account, and column_domain. If the email address cannot be parsed correctly, the additional columns will still be generated, but they would be set to null depending on the parts that could not be parsed.
The split_url function splits a URL into protocol, authority, host, port, path, filename, and query. The function will parse the URL into its constituents. Upon splitting the URL, the directive creates seven new columns by appending to the original column name: column_protocol column_authority column_host column_port column_path column_filename column_query If the URL cannot be parsed correctly, an exception is thrown. If the URL column does not exist, columns with a null value are added to the record.
The trim function trim whitespace from both sides, left side or right side of string values they are applied to. One can supply method as trim/ltrim/rtrim
The unpivot function works as the reverse function for the pivot function in which you can achieve rotating column values into rows values.
functions:-name:unpivotcolumns:# Columns can have - "*" if need to select all remaining columns than pivot columns-{{USA}}or "*"pivotColumns:# pivotColumns can have - "*" if need to select all remaining columns than columns-{{Col1}}-{{Col2}}-{{Col3}}keyColumnName:{{Country}}valueColumnName:{{Amount}}
Make sure provided list has the same datatype else it will throw an error. Do not forget to clean the column before passing * in columns if column names have hyphens and spaces.