Transformer Library
This page details the different transformations available in the Scuba Transformer Library.
Extract your data
bunzip2
Decompress bz2 files.
Example
Decompress test.json.bz2
Configuration | Input | Output |
---|---|---|
[ | test.json.bz2 | {"a": 1, "b": 2} |
decode
Decode file using specified character encoding.
Parameter | Type | Required? | Description |
---|---|---|---|
encoding | string | No | The character encoding to use. This parameter is optional, and defaults to utf-8. |
Example
Decode a file.
Configuration | Input | Output |
---|---|---|
[ | n/a | n/a |
gunzip
Decompress gzip files.
Example
Decompress test.json.gz
Configuration | Input | Output |
---|---|---|
[ | test.json.gz | {"a": 1, "b": 2} |
head
Read the beginning of each input file.
Parameter | Type | Required? | Description |
---|---|---|---|
sample_size | int | Yes | Number of lines to read from the beginning of each file |
Example
Take the first two lines of input files.
Configuration | Input | Output |
---|---|---|
[ | {"a": 1, "b": 1} | {"a": 1, "b": 1} |
regex_replace_line
Perform regular expression replacement on the line. The replacement string uses Python-style backreferences.
Parameter | Type | Required? | Description |
---|---|---|---|
regex | string | Yes | Regular expression to match |
repl_string | string | Yes | Replacement string to substitute |
Example
Fix unquoted JSON key (user) before the JSON load step.
Configuration | Input | Output |
---|---|---|
[ | {user: "u1", "trans": "x1", "val": 12} | {"user": "u1", "trans": "x1", "val": 12} |
sample
Take a random sample of each input file.
Parameter | Type | Required? | Description |
---|---|---|---|
sample_size | int | Yes | Number of lines to sample |
random_seed | int | Yes | Seed for random number generator. Used to sample same random events across multiple runs. |
Example
Sample two lines from three total.
Configuration | Input | Output |
---|---|---|
[ | {"a": 1, "b": 1} | {"a": 3, "b": 3} |
tail
Read the end of each input file.
Parameter | Type | Required? | Description |
---|---|---|---|
sample_size | int | Yes | Number of lines to read from the end of each file |
Example
Take the last two lines of input files.
Configuration | Input | Output |
---|---|---|
[ | {"a": 1, "b": 1} | {"a": 2, "b": 2} |
unpack_json_array
Split array of objects into individual objects. This is useful when event logs have multiple events in a single JSON array.
Parameter | Type | Required? | Description |
---|---|---|---|
startdepth | int | No | Generate objects starting at the specified depth. By default, this attempts a |
Example
Unpack three JSON objects within single JSON array into line separated objects.
Configuration | Input | Output |
---|---|---|
[ | [{"a": 1, "b": 2}, {"c": 3, "d": 4}, {"e": 1, "f": 2}] | {"a": 1, "b": 2} |
Load your data
csv_load
Load CSV data into dictionaries and calculate a unique token that Scuba uses to deduplicate lines. The token is placed in a column named __ia__unique__token__
.
By default, csv_load
assumes the first line of the file is a primary header row specifying field names. If the CSV file does not contain a header row, or you wish to override the existing header, you can configure a custom "primary header" row.
In the case where one or more CSV lines do not match the number of fields in the primary header, csv_load
uses a best-effort algorithm:
If a row of the CSV file has fewer fields than the header,
csv_load
fills as many fields as possible in order.If a row of the CSV file has more fields than the header,
csv_load
generates extra header names likefield_N
,field_N+1
, starting with the number of fields already present in the header.
If you want more control over handling non-standard CSV lines, you can specify additional header rows (each with a distinct number of fields), in which case csv_load
will attempt to find a header with exactly the same number of fields as the current line and use that one. If neither the primary header nor any alternate header matches the number of fields in the current line, csv_load
will use the "primary header" and fall back to its best-effort algorithm.
Parameter | Type | Required? | Description |
---|---|---|---|
unique_columns | list | No | List of columns to use to generate unique token. If not specified, the original line will be used. |
delimiter | string | No | Character to use as delimiter, e.g. a comma or a tab. |
headers | list | No | List of list of headers. Headers are chosen by number of elements in each CSV line. |
header_row | string | No | One of If If no |
primary_header | list | No | Headers to use if no other headers match. |
doublequote | boolean | No | Controls how instances of quotechar appearing inside a field should themselves be quoted. When The default value is |
escapechar | string | No | Removes any special meaning from the following character. It defaults to |
quotechar | string | No | A one-character string used to quote fields containing special characters, such as the delimiter or quotechar, or which contain new-line characters. It defaults to |
skipinitialspace | boolean | No | When The default is |
Example
Load CSV data into dictionary, dump into JSON.
Configuration | Input | Output |
---|---|---|
[ | User,Event,Value | {"Event":"Login","Value":"0","User":"John","__ia__unique__token__":6739147714641650777} |
json_load
Deserialize JSON and calculate a unique token that Scuba uses to deduplicate lines.
Parameter | Type | Required? | Description |
---|---|---|---|
unique_columns | list | No | List of columns to use to generate a unique token. This parameter is optional. If not specified, the original line will be used. |
Example
Specify columns user
and trans
to generate a unique token.
Configuration | Input | Output |
---|---|---|
[ | {"user": "u1", "trans": "x1", "val": 12} | {"user":"u1","__ia__unique__token__":6073597307750309666,"trans":"x1","val":12} |
unique_token
Create a unique token that Scuba uses to deduplicate events based on columns
.
Parameter | Type | Required? | Description |
---|---|---|---|
columns | list | Yes | List of columns to use to generate a unique token. |
Example
Specify columns user
and trans
to generate a unique token.
Configuration | Input | Output |
---|---|---|
[ | {"user": "u1", "trans": "x1", "val": 12} | {"user":"u1","__ia__unique__token__":6073597307750309666,"trans":"x1","val":12} |
Transform your data
add_file_date
Set column
to the datestamp of the file the event is from. If specifying input files as arguments on the command line, requires passing the --file-date
flag in %Y-%m-%dT%H:%M:%S.%fZ format.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | Column to add the file date to. |
Example
Add file date to events.
Configuration | Input | Output |
---|---|---|
[ | {"user": "Sam", "amt": 1} | {"amt":1,"filedate":"2016-05-05T12:12:12.123Z","user":"Sam","__ia__unique__token__":6792123659669767925} |
add_filename
Set column
to the name of the file the event is from.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | Column to add the file name to. |
Example
Add filename in.json
to events.
Configuration | Input | Output |
---|---|---|
[ | {"user": "Sam", "amt": 1} | {"__ia__unique__token__":6792123659669767925,"filename":"in.json","user":"Sam","amt":1} |
add_label
Set a column to the value label
for every event.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | Name of column to put label in. |
label | string | Yes | Value to put in column. |
Example
Add category : electronics
to every event.
Configuration | Input | Output |
---|---|---|
[ | {"user": "Sam", "amt": 1} | {"__ia__unique__token__":6792123659669767925,"category":"electronics","user":"Sam","amt":1} |
anonymize
Calculate 52-bit sha1 hash.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | The column to anonymize |
output_column | string | No | The column to add the anonymized data to. The default value is |
Example
Anonymize username
.
Configuration | Input | Output |
---|---|---|
[ | {"username": "BenHogan", "event": "putt", "dist": 3} | {"event":"putt","username":2528885808796705,"dist":3,"__ia__unique__token__":4345827873645831930} |
code_snippet
Write custom Python code to transform each event.
Parameter | Type | Required? | Description |
---|---|---|---|
code | string | Yes | Python statements operating on the variable |
import_list | list | No | Module names to import for use in this code snippet. |
Example
Use code_snippet
to copy the event
column to the action
column.
Configuration | Input | Output |
---|---|---|
[ | {"user": "John", "event": "click"} | {"user":"John","event":"click","__ia__unique__token__":8820840440868449619,"action":"click"} |
convert_to_array
Split column
on separator
into a list of values in output_column
.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | The column to operate on. |
output_column | string | No | The column to add the converted data to. The default is |
separator | string | Yes | String to split on; e.g., |
Example
Split blurb
into an array containing individual words in blurb
.
Configuration | Input | Output |
---|---|---|
[ | {"user": "John", "blurb": "I like trains"} | {"__ia__unique__token__":8715638442104830642,"user":"John","blurb":["I","like","trains"]} |
deep_copy
Make a complete copy of a column.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | Column name to copy from. |
new_name | string | Yes | Column name to copy to. |
Example
Copy user
values to new column usercopy
.
Configuration | Input | Output |
---|---|---|
[ | {"user": "Sam", "amt": 1} | {"amt":1,"user":"Sam","__ia__unique__token__":6792123659669767925,"usercopy":"Sam"} |
dictionary_replace
Apply dictionary replacement to a column.
Parameter | Type | Required? | Description |
---|---|---|---|
mapping | dict | Yes | Dictionary mapping strings to replacements. |
column | string | Yes | The column to operate on. |
output_column | string | No | Column name to put the output into. Defaults to |
Example
Use dictionary_replace
to change product codes into product names.
Configuration | Input | Output |
---|---|---|
[ | {"first": "Sam", "product": "xkdruw67", "event": "purchase"} | {"__ia__unique__token__":6913206265407409391,"event":"purchase","product":"mp3player","first":"Sam"} |
divide
Perform floating point division.
If
divisor
is a string, treat it as a column name.If
divisor
column is null, result will be null.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | Column to use as multiplicand |
divisor | string | Yes | Value or column to use as divisor |
output_column | string | No | Column name to put the output into. Defaults to |
Example
Divide CPM
column by one million.
Configuration | Input | Output |
---|---|---|
[ | {"first": "Sam", "event": "viewad", "CPM": 2385} | {"event":"viewad","actual_cost":0.002385,"first":"Sam","CPM":2385,"__ia__unique__token__":2322298538924664253} |
flatten_dict
Flatten dictionary by concatenating successive keys with a .
("dot") character.
Parameter | Type | Required? | Description |
---|---|---|---|
columns | list | No | List of column names to flatten. Defaults to all columns. |
depth_limit | int | No | Recursion limit. Defaults to no limit. |
Example
Flatten user
column one level.
Configuration | Input | Output |
---|---|---|
[ | {"user": {"name": "John", "info": {"phone": 5551234, "city": "Redwood City"}}, "event": "login"} | {"user.name":"John","event":"login","__ia__unique__token__":4975147593049633177,"user.info":{"phone":5551234,"city":"Redwood City"}} |
hash
Calculate non-cryptographic murmurhash3.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | The column to hash |
output_column | string | No | The column to put the hash value in. The default value is |
Example
Hash username
Configuration | Input | Output |
---|---|---|
[ | {"username": "BenHogan", "event": "putt", "dist": 3} | {"username":8545539730387850342,"dist":3,"__ia__unique__token__":4345827873645831930,"event":"putt"} |
if_then_otherwise
Run steps conditionally. If condition
evaluates to True
for an individual event, the then
list of steps will be run. Otherwise, the otherwise
list of steps will be run (if provided).
Parameter | Type | Required? | Description |
---|---|---|---|
condition | step | Yes | Condition to determine which steps to run. |
then | list | Yes | Steps to use if |
otherwise | list | No | Steps to use if |
Conditions
Condition | Parameter 1 | Type | Parameter 2 | Type |
---|---|---|---|---|
regex_match | column | string | pattern | string |
regex_search | column | string | pattern | string |
contains | column | string |
|
|
starts | column | string | prefix | string |
ends | column | string | suffix | string |
in_list | column | string | value | string |
not_in_list | column | string | value | string |
equals | column | string | value | string |
not_equals | column | string | value | string |
lt (less than) | column | string | value | string |
gt (greater than) | column | string | value | string |
lte (less than or equal) | column | string | value | string |
gte (greater than or equal) | column | string | value | string |
bool_eval | code | string |
|
|
and_ | conditions | list |
|
|
or_ | conditions | list |
|
|
Example
If the URL path indicates a payment, add the page
= pay
label.
Configuration | Input | Output |
---|---|---|
[ | {"user": "John", "url.path": "/pay/addcard/"} | {"user":"John","__ia__unique__token__":6826495190851080943,"page":"pay","url.path":"/pay/addcard/"} |
join
Join columns
together with separator
between each. Null columns are skipped.
Parameter | Type | Required? | Description |
---|---|---|---|
columns | list | Yes | List of columns names to join |
output_column | string | Yes | Column in which to put the resulting string |
separator | string | No | String to put between each each column. Defaults to the empty string. |
Example
Join first
and last
name columns into a fullname
column.
Configuration | Input | Output |
---|---|---|
[ | {"first": "Sam", "last": "Ryan"} | {"__ia__unique__token__":6631810923594382537,"fullname":"Sam Ryan","first":"Sam","last":"Ryan"} |
json_load_column
Deserialize a column containing a JSON encoded string. This is useful for CSV files containing JSON encoded strings.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | The column to operate on. |
output_column | string | No | The column to put the output into. The default value is |
Example
Load value of embed
as JSON.
Configuration | Input | Output |
---|---|---|
[ | {"user": John, "embed": "{\"action\": \"click\"}"} | {"__ia__unique__token__":673996973727761320,"embed":{"action":"click"},"user":"John"} |
keep
Keep only the specified columns, dropping all others. Does not drop the Scuba deduplication token.
Parameter | Type | Required? | Description |
---|---|---|---|
columns | list | Yes | List of column names to keep. |
Example
Keep only user
and event
columns.
Configuration | Input | Output |
---|---|---|
[ | {"user": "Sam", "event": "login", "blah": "blah"} | {"event":"login","user":"Sam","__ia__unique__token__":8468468302936289505} |
lowercase
Convert all or some column names to lowercase strings. If you do not specify columns
, a lowercased column name will be created and set to null.
This is different from the lowercase_value()
parameter.
Parameter | Type | Required? | Description |
---|---|---|---|
columns | list | No | Columns to be renamed lowercase. Default is all columns. |
Example
Change uppercase characters in USERNAME
to lowercase.
Configuration | Input | Output |
---|---|---|
[ | {"USERNAME": "BenHogan", "event": "putt", "dist": 3} | {"dist":3,"username":"BenHogan","__ia__unique__token__":8497066893068604652,"event":"putt"} |
lowercase_value
Convert a column to a lowercase string.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | Column that contains values to be lowercased. |
output_column | string | No | Column to put the output into. Defaults to |
Example
Convert values of text
field to lowercase.
Configuration | Input | Output |
---|---|---|
[ | {"user": "Sam", "text": "Wet Paint"} | {"__ia__unique__token__":2233568130467114876,"text":"wet paint","user":"Sam"} |
merge_keys
Merge column_1
and column_2
into output_column
using sha1.
Combine two mutually exclusive shard key columns into one, allowing them to share a table copy.
If
column_1
isNone
andcolumn_2
isnot None
,output_column
will be set to digest ofcolumn_2
with LSB set to0
.If
column_2
isNone
andcolumn_1
isnot None
,output_column
will be set to digest ofcolumn_1
with LSB set to1
.If both are set, or both are
None
,output_column
will be set toNone
.
Parameter | Type | Required? | Description |
---|---|---|---|
column_1 | string | Yes | Column to merge and set LSB to 0. |
column_2 | string | Yes | Column to merge and set LSB to 1. |
output_column | string | Yes | Column to put the result into. |
Example
Merge columns web1
and web2
into column merge
.
Configuration | Input | Output |
---|---|---|
[ | {"web1": null, "web2": "w1", "val": 12} | {"val":12,"__ia__unique__token__":7297373526901799213,"merge":724008111296322,"web2":"w1","web1":null} |
multiply
Perform floating point multiplication.
If
multiplier
is a string, treat it as a column name.If
multiplier
column is null, result will be null.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | Column to use as multiplicand |
multiplier | string | Yes | Value or column to use as multiplier |
output_column | string | No | Column name to put the output into. Defaults to |
Example
Multiply the price
column by the quantity
column and put the results in the total
column.
Configuration | Input | Output |
---|---|---|
[ | {"first": "Sam", "event": "purchase", "price": 2.13, "quantity": 4} | {"event":"purchase","quantity":4,"price":2.13,"first":"Sam","total":8.52,"__ia__unique__token__":2308855450660799997} |
name_value_extract
Extract list of generic properties into one object.
Parameter | Type | Required? | Description |
---|---|---|---|
name_field | string | Yes | Name of field to treat as generic property name. |
value_field | string | Yes | Name of field to treat as generic property value. |
column | string | Yes | The list column to extract values from. |
output_column | string | No | Column to put the output into. Defaults to |
Example
Extract price
and category
info from the item
field.
Configuration | Input | Output |
---|---|---|
[ | {"user": "John", "items": [{"label":"tab.abc","category":"abc","price":17.80},{"label":"tab.f","category":"food","price":41.00}]} | {"items":{"food":41.0,"abc":17.8},"user":"John","__ia__unique__token__":1918698498990518500} |
omit
Omit the specified columns.
Parameter | Type | Required? | Description |
---|---|---|---|
columns | list | Yes | List of column names to omit |
Example
Omit column blah
from events.
Configuration | Input | Output |
---|---|---|
[ | {"user": "Sam", "amt": 1, "blah": "blah"} | {"user":"Sam","__ia__unique__token__":3893872012624762980,"amt":1} |
regex_array_extract
For each element in an array, extract matching regular expression subgroups into new columns.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | Name of the column to extract from. |
regex | string | Yes | Regular expression to match. |
output_columns | list | Yes | List of column names to put regex subgroups into, in order. |
Example
Extract categories from list of products.
Configuration | Input | Output |
---|---|---|
[ | {"user": "John", "products": ["produce-hdy438ed", "cleaning-or944dus", "seafood-pd33sk2q"]} | {"user":"John","__ia__unique__token__":3040505961674288811,"categories":["produce","cleaning","seafood"],"products":["produce-hdy438ed","cleaning-or944dus","seafood-pd33sk2q"]} |
regex_array_replace
Perform regular expression replacement on each element in an array. Replacement string uses Python-style backreferences.
This will do nothing if the column is not an array.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | Column to perform replacement with. |
regex | string | Yes | Regular expression to match. |
repl_string | string | Yes | Replacement string to substitute. |
output_column | string | No | Column name to put the output into. Defaults to |
Example
Use regex_array_replace
to remove trailing identifiers from items in the products
list.
Configuration | Input | Output |
---|---|---|
[ | {"user": "John", "products": ["produce-hdy438ed", "cleaning-or944dus", "seafood-pd33sk2q"]} | {"user":"John","products":["produce","cleaning","seafood"],"__ia__unique__token__":3040505961674288811} |
regex_extract
Extract matching regular expression subgroups into new columns.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | Name of column to extract from. |
regex | string | Yes | Regular expression to match with. Regex must match the entire string. |
output_columns | list | Yes | List of column names to put regex subgroups into in order. |
Example
Extract plan
query parameter from non-URL field uri
.
Configuration | Input | Output |
---|---|---|
[ | {"user": "John", "uri": "/channel/videoplay/?vidid=7246927612831078230&plan=pro&view=dark&loggedin=yes"} | {"user":"John","uri":"/channel/videoplay/?vidid=7246927612831078230&plan=pro&view=dark&loggedin=yes","__ia__unique__token__":6723934658596588009,"plan":"pro"} |
regex_replace
Perform regular expression replacement on column
. The replacement string uses Python-style backreferences.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | Column to perform the replacement with. |
regex | string | Yes | Regular expression to match. |
repl_string | string | Yes | Replacement string to substitute. |
output_column | string | No | Column name to put the output into. Defaults to |
Example
Use regex_replace
to prefix the uri
field with scheme and host.
Configuration | Input | Output |
---|---|---|
[ | {"user": "John", "uri": "/channel/videoplay/?vidid=7246927612831078230&plan=pro&view=dark&loggedin=yes"} | {"user":"John","__ia__unique__token__":6723934658596588009,"uri":"https://john.scuba.io/channel/vi...k&loggedin=yes"} |
remove_punctuation
Remove all punctuation from a column. Assumes unicode characters.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | The column to remove punctuation from. |
output_column | string | No | Column name to put the output into. Defaults to |
Example
Strip punctuation from the message
column.
Configuration | Input | Output |
---|---|---|
[ | {"user": "Punctuation", "message": "!@#%&*()_-{}[]:;,.?"} | {"message":"","user":"Punctuation","__ia__unique__token__":1267513633946236610} |
rename
Rename a column.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | Column to rename |
new_name | string | Yes | New name for the column |
Example
Rename column blah to event_type
.
Configuration | Input | Output |
---|---|---|
[ | {"user": "Sam", "amt": 1, "blah": "blah"} | {"amt":1,"event_type":"blah","user":"Sam","__ia__unique__token__":3893872012624762980} |
shred_array
Change array of objects into an object of arrays.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | The column to operate on |
output_column | string | No | The column to add the output into. Defaults to |
Example
Shred array exp
.
Configuration | Input | Output |
---|---|---|
[ | {"user": "John", "exp": [{"var": "a", "val": "1"}, {"var": "b", "val": "4"}]} | {"__ia__unique__token__":7370096177795160827,"user":"John","exp":{"val":["1","4"],"var":["a","b"]}} |
split_array
Split a column containing an array into multiple columns.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | Split this column into new columns if it is an array. |
output_columns | string | No | Column names to use. Defaults to |
Example
Split values in array exp
into three new columns.
Configuration | Input | Output |
---|---|---|
[ | {"user": "John", "exp": ["cat","dog","hamster"]} | {"exp":["cat","dog","hamster"],"animal1":"cat","animal3":"hamster","user":"John","__ia__unique__token__":2032956903660321772,"animal2":"dog"} |
time_convert
Reformat a time column, or do nothing if none of the read formats match. Uses Python-style time format strings, or one of epoch_seconds
, epoch_millis
, epoch_micros
.
If testing on the command line, the max_time_ago
and max_time_hence
parameters require passing the --file-date
flag.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | The column to perform time conversion on. |
output_column | string | No | The column to put the result in. Defaults to |
read_formats | list | Yes | List of datetime format strings to try, in order. |
write_format | string | No | Datetime format string for output. Defaults to %Y-%m-%dT%H:%M:%S.%fZ |
max_time_ago | string | No | Events older than the file date minus |
max_time_hence | string | No | Events more recent than the file date plus |
max_abs_time | string | No | Events stamped after this ISO-8601 timestamp will be silently dropped. |
min_abs_time | string | No | Events stamped before this ISO-8601 timestamp will be silently dropped. |
Example
Convert a variety of time formats into %Y-%m-%d %H:%M:%S.
Configuration | Input | Output |
---|---|---|
[ | {"ts": "Tuesday, March 22 2016"} | {"__ia__unique__token__":1033586384520328482,"ts":"2016-03-22 00:00:00"} |
url_parse
Parse URL into dictionary.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | Column containing the URL to be parsed |
output_column | string | No | The column to put the output into. Defaults to |
Example
Parse the referrer
URL into dictionary.
Configuration | Input | Output |
---|---|---|
[ | {"user": "John", "referrer": "john.scuba.io:80/videoplay?docid=-7246927612831078230&hl=en&view=dark&loggedin=yes#00h02m30s"} | {"__ia__unique__token__":7912114277290934867,"referrer":{"fragment":"00h02m30s","query":"docid=-7246927612831078230&hl=en&view=dark&loggedin=yes","scheme":"https","path":"/videoplay","params":"","port":80,"hostname":"john.scuba.io"},"user":"John"} |
url_query_string_parse
Parse a url query string of the form key1=value1&key2=value2
.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | The column to remove punctuation from. |
output_column | string | No | Column to put result in. The default value is |
keep_parameters | list | Yes | Only keep parameters in the query string from this list. |
Example
Further parse referrer.query
into selected parameters.
Configuration | Input | Output |
---|---|---|
[ | {"user": "John", "referrer": "john.scuba.io:80/videoplay?docid=-7246927612831078230&hl=en&view=dark&loggedin=yes#00h02m30s"} | {"user":"John","__ia__unique__token__":7912114277290934867,"referrer.port":80,"referrer.params":"","referrer.path":"/videoplay","referrer.fragment":"00h02m30s","referrer.scheme":"https","referrer.hostname":"john.scuba.io","referrer.query":{"hl":"en","view":"dark","loggedin":"yes"}} |
whitelist_string_values
Remove values from a string column that are not in the given list.
Parameter | Type | Required? | Description |
---|---|---|---|
column | string | Yes | Column to operate on |
whitelist | list | Yes | List of allowed string values |
replacement | string | No | The string used to replace values not in |
Example
Allow only accepted values in the source
column.
Configuration | Input | Output |
---|---|---|
[ | {"user": "Sam", "event": "click", "source": "frontend"} | {"event":"click","user":"Sam","__ia__unique__token__":6897873284472877507,"source":"frontend"} |
Complete the data transformation process
json_dump
Returns a dictionary as a JSON-encoded object. This is the only dictionary-to-line step. This should generally be the last step.
Example
Transform Python dictionary to JSON for ingest.
Configuration | Input | Output |
---|---|---|
["json_dump"] | n/a | n/a |