This page details the different transformations available in the Scuba Transformer Library.

Extract your data 

bunzip2 

Decompress bz2 files.

Example 

Decompress test.json.bz2

Configuration

Input

Output

[
["bunzip2"],
["decode"]
]

test.json.bz2

{"a": 1, "b": 2}
{"c": 3, "d": 4}
{"e": 1, "f": 2}

decode 

Decode file using specified character encoding.

Parameter

Type

Required?

Description

encoding

string

No

The character encoding to use. This parameter is optional, and defaults to utf-8.

Example 

Decode a file.

Configuration

Input

Output

[
["decode"]
]

n/a

n/a

gunzip 

Decompress gzip files.

Example 

Decompress test.json.gz

Configuration

Input

Output

[
["gunzip"],
["decode"]
]

test.json.gz

{"a": 1, "b": 2}
{"c": 3, "d": 4}
{"e": 1, "f": 2}

head 

Read the beginning of each input file.

Parameter

Type

Required?

Description

sample_size

int

Yes

Number of lines to read from the beginning of each file

Example 

Take the first two lines of input files.

Configuration

Input

Output

[
["head", {"sample_size": 2}],
["decode"]
]

{"a": 1, "b": 1}
{"a": 2, "b": 2}
{"a": 3, "b": 3}

{"a": 1, "b": 1}
{"a": 2, "b": 2}

regex_replace_line 

Perform regular expression replacement on the line. The replacement string uses Python-style backreferences.

Parameter

Type

Required?

Description

regex

string

Yes

Regular expression to match

repl_string

string

Yes

Replacement string to substitute

Example 

Fix unquoted JSON key (user) before the JSON load step.

Configuration

Input

Output

[
["decode"],
["regex_replace_line", {"regex": "user:", "repl_string": "\"user\":"}]
]

{user: "u1", "trans": "x1", "val": 12}
{user: "u1", "trans": "x2", "val": 9}
{user: "u2", "trans": "x1", "val": 21}

{"user": "u1", "trans": "x1", "val": 12}
{"user": "u1", "trans": "x2", "val": 9}
{"user": "u2", "trans": "x1", "val": 21}

sample 

Take a random sample of each input file.

Parameter

Type

Required?

Description

sample_size

int

Yes

Number of lines to sample

random_seed

int

Yes

Seed for random number generator. Used to sample same random events across multiple runs.

Example 

Sample two lines from three total.

Configuration

Input

Output

[
["sample", {"sample_size": 2}],
["decode"]
]

{"a": 1, "b": 1}
{"a": 2, "b": 2}
{"a": 3, "b": 3}

{"a": 3, "b": 3}
{"a": 2, "b": 2}

tail 

Read the end of each input file.

Parameter

Type

Required?

Description

sample_size

int

Yes

Number of lines to read from the end of each file

Example 

Take the last two lines of input files.

Configuration

Input

Output

[
["tail", {"sample_size": 2}],
["decode"]
]

{"a": 1, "b": 1}
{"a": 2, "b": 2}
{"a": 3, "b": 3}

{"a": 2, "b": 2}
{"a": 3, "b": 3}

unpack_json_array 

Split array of objects into individual objects. This is useful when event logs have multiple events in a single JSON array.

Parameter

Type

Required?

Description

startdepth

int

No

Generate objects starting at the specified depth.

By default, this attempts a startdepth of 1 and 2.

Example 

Unpack three JSON objects within single JSON array into line separated objects.

Configuration

Input

Output

[
["unpack_json_array"],
["decode"]
]

[{"a": 1, "b": 2}, {"c": 3, "d": 4}, {"e": 1, "f": 2}]

{"a": 1, "b": 2}
{"c": 3, "d": 4}
{"e": 1, "f": 2}

Load your data 

csv_load 

Load CSV data into dictionaries and calculate a unique token that Scuba uses to deduplicate lines. The token is placed in a column named __ia__unique__token__.

By default, csv_load assumes the first line of the file is a primary header row specifying field names. If the CSV file does not contain a header row, or you wish to override the existing header, you can configure a custom "primary header" row.  

In the case where one or more CSV lines do not match the number of fields in the primary header, csv_load uses a best-effort algorithm:

  • If a row of the CSV file has fewer fields than the header, csv_load fills as many fields as possible in order.

  • If a row of the CSV file has more fields than the header, csv_load generates extra header names like field_Nfield_N+1, starting with the number of fields already present in the header.

If you want more control over handling non-standard CSV lines, you can specify additional header rows (each with a distinct number of fields), in which case csv_load will attempt to find a header with exactly the same number of fields as the current line and use that one. If neither the primary header nor any alternate header matches the number of fields in the current line, csv_load will use the "primary header" and fall back to its best-effort algorithm.

Parameter

Type

Required?

Description

unique_columns

list

No

List of columns to use to generate unique token. If not specified, the original line will be used.

delimiter

string

No

Character to use as delimiter, e.g. a comma or a tab.

headers

list

No

List of list of headers. Headers are chosen by number of elements in each CSV line.

header_row

string

No

One of useskip, or no_header(the default value). 

If use, the header in the file is used if no headers match.

If no primary_header is set, the file header is used as the primary.

primary_header

list

No

Headers to use if no other headers match.

doublequote

boolean

No

Controls how instances of quotechar appearing inside a field should themselves be quoted. When True, the character is doubled. When False, the escapechar is used as a prefix to the quotechar.

The default value is True.

escapechar

string

No

Removes any special meaning from the following character. It defaults to None, which disables escaping.

quotechar

string

No

A one-character string used to quote fields containing special characters, such as the delimiter or quotechar, or which contain new-line characters. It defaults to '"'.

skipinitialspace

boolean

No

When True, whitespace immediately following the delimiter is ignored.

The default is False.

Example 

Load CSV data into dictionary, dump into JSON.

Configuration

Input

Output

[
["decode"],
["csv_load"],
["json_dump"]
]

User,Event,Value
John,Login,0
Todd,Logout,1

{"Event":"Login","Value":"0","User":"John","__ia__unique__token__":6739147714641650777}
{"Event":"Logout","Value":"1","User":"Todd","__ia__unique__token__":311502209663632137}

json_load 

Deserialize JSON and calculate a unique token that Scuba uses to deduplicate lines.

Parameter

Type

Required?

Description

unique_columns

list

No

List of columns to use to generate a unique token.

This parameter is optional. If not specified, the original line will be used.

Example 

Specify columns user and trans to generate a unique token.

Configuration

Input

Output

[
["decode"],
["json_load", {"unique_columns": ["user", "trans"]}],
["json_dump"]
]

{"user": "u1", "trans": "x1", "val": 12}
{"user": "u1", "trans": "x2", "val": 9}
{"user": "u2", "trans": "x1", "val": 21}

{"user":"u1","__ia__unique__token__":6073597307750309666,"trans":"x1","val":12}
{"user":"u1","__ia__unique__token__":8385417071786987290,"trans":"x2","val":9}
{"user":"u2","__ia__unique__token__":2558069477487871788,"trans":"x1","val":21}

  

unique_token 

Create a unique token that Scuba uses to deduplicate events based on columns.

Parameter

Type

Required?

Description

columns

list

Yes

List of columns to use to generate a unique token.

Example 

Specify columns user and trans to generate a unique token.

Configuration

Input

Output

[
["decode"],
["unique_token", {"user", "trans"}],
["json_dump"]
]

{"user": "u1", "trans": "x1", "val": 12}
{"user": "u1", "trans": "x2", "val": 9}
{"user": "u2", "trans": "x1", "val": 21}

{"user":"u1","__ia__unique__token__":6073597307750309666,"trans":"x1","val":12}
{"user":"u1","__ia__unique__token__":8385417071786987290,"trans":"x2","val":9}
{"user":"u2","__ia__unique__token__":2558069477487871788,"trans":"x1","val":21}

Transform your data 

add_file_date 

Set column to the datestamp of the file the event is from. If specifying input files as arguments on the command line, requires passing the --file-date flag in %Y-%m-%dT%H:%M:%S.%fZ format.

Parameter

Type

Required?

Description

column

string

Yes

Column to add the file date to.

Example 

Add file date to events.

Configuration

Input

Output

[
["decode"],
["json_load"],
["add_file_date", {"column": "filedate"}],
["json_dump"]
]

{"user": "Sam", "amt": 1}
{"user": "John", "amt": 2}
{"user": "Elyse", "amt": 3}

{"amt":1,"filedate":"2016-05-05T12:12:12.123Z","user":"Sam","__ia__unique__token__":6792123659669767925}
{"amt":2,"filedate":"2016-05-05T12:12:12.123Z","user":"John","__ia__unique__token__":371460198962001743}
{"amt":3,"filedate":"2016-05-05T12:12:12.123Z","user":"Elyse","__ia__unique__token__":2941371116376991439}

add_filename 

Set column to the name of the file the event is from.

Parameter

Type

Required?

Description

column

string

Yes

Column to add the file name to.

Example 

Add filename in.json to events.

Configuration

Input

Output

[
["decode"],
["json_load"],
["add_filename", {"column": "filename"}],
["json_dump"]
]

{"user": "Sam", "amt": 1}
{"user": "John", "amt": 2}
{"user": "Elyse", "amt": 3}

{"__ia__unique__token__":6792123659669767925,"filename":"in.json","user":"Sam","amt":1}
{"__ia__unique__token__":371460198962001743,"filename":"in.json","user":"John","amt":2}
{"__ia__unique__token__":2941371116376991439,"filename":"in.json","user":"Elyse","amt":3}

add_label 

Set a column to the value label for every event.

Parameter

Type

Required?

Description

column

string

Yes

Name of column to put label in.

label

string

Yes

Value to put in column.

Example 

Add category : electronics to every event.

Configuration

Input

Output

[
["decode"],
["json_load"],
["add_label", {"column": "category", "label": "electronics"}],
["json_dump"]
]

{"user": "Sam", "amt": 1}
{"user": "John", "amt": 2}
{"user": "Elyse", "amt": 3}

{"__ia__unique__token__":6792123659669767925,"category":"electronics","user":"Sam","amt":1}
{"__ia__unique__token__":371460198962001743,"category":"electronics","user":"John","amt":2}
{"__ia__unique__token__":2941371116376991439,"category":"electronics","user":"Elyse","amt":3}

anonymize 

Calculate 52-bit sha1 hash.

Parameter

Type

Required?

Description

column

string

Yes

The column to anonymize

output_column

string

No

The column to add the anonymized data to. The default value is column.

Example 

Anonymize username.

Configuration

Input

Output

[
["decode"],
["json_load"],
["anonymize", {"column": "username"}],
["json_dump"]
]

{"username": "BenHogan", "event": "putt", "dist": 3}
{"username": "ArnoldPalmer", "event": "chip", "dist": 15}
{"username": "JackNicklaus", "event": "drive", "dist": 267}

{"event":"putt","username":2528885808796705,"dist":3,"__ia__unique__token__":4345827873645831930}
{"event":"chip","username":1922980560516075,"dist":15,"__ia__unique__token__":7955354759641172706}
{"event":"drive","username":1542261379125033,"dist":267,"__ia__unique__token__":3709088310454520588}

code_snippet 

Write custom Python code to transform each event.

Parameter

Type

Required?

Description

code

string

Yes

Python statements operating on the variable line.

import_list

list

No

Module names to import for use in this code snippet.

Example 

Use code_snippet to copy the event column to the action column.

Configuration

Input

Output

[
["decode"],
["json_load"],
["code_snippet", {"code": '''
line['action'] = line['event']
'''}],
["json_dump"]
]

{"user": "John", "event": "click"}
{"user": "John", "event": "purchase"}

{"user":"John","event":"click","__ia__unique__token__":8820840440868449619,"action":"click"}
{"user":"John","event":"purchase","__ia__unique__token__":1537361993526075803,"action":"purchase"}

convert_to_array 

Split column on separator into a list of values in output_column.

Parameter

Type

Required?

Description

column

string

Yes

The column to operate on.

output_column

string

No

The column to add the converted data to. The default is column.

separator

string

Yes

String to split on; e.g., ,  (comma).

Example 

Split blurb into an array containing individual words in blurb.

Configuration

Input

Output

[
["decode"],
["json_load"],
["convert_to_array", {"column": "blurb", "separator": " "}],
["json_dump"]
]

{"user": "John", "blurb": "I like trains"}
{"user": "Sam", "blurb": "I hate pizza"}
{"user": "Dave", "blurb": "I like pizza"}

{"__ia__unique__token__":8715638442104830642,"user":"John","blurb":["I","like","trains"]}
{"__ia__unique__token__":3445418821193718572,"user":"Sam","blurb":["I","hate","pizza"]}
{"__ia__unique__token__":7497843545051763820,"user":"Dave","blurb":["I","like","pizza"]}

deep_copy 

Make a complete copy of a column.

Parameter

Type

Required?

Description

column

string

Yes

Column name to copy from.

new_name

string

Yes

Column name to copy to.

Example 

Copy user values to new column usercopy.

Configuration

Input

Output

[
["decode"],
["json_load"],
["deep_copy", {"column": "user", "new_name": "usercopy"}],
["json_dump"]
]

{"user": "Sam", "amt": 1}
{"user": "John", "amt": 2}
{"user": "Elyse", "amt": 3}

{"amt":1,"user":"Sam","__ia__unique__token__":6792123659669767925,"usercopy":"Sam"}
{"amt":2,"user":"John","__ia__unique__token__":371460198962001743,"usercopy":"John"}
{"amt":3,"user":"Elyse","__ia__unique__token__":2941371116376991439,"usercopy":"Elyse"}

dictionary_replace 

Apply dictionary replacement to a column.

Parameter

Type

Required?

Description

mapping

dict

Yes

Dictionary mapping strings to replacements.

column

string

Yes

The column to operate on.

output_column

string

No

Column name to put the output into. Defaults to column.

Example 

Use dictionary_replace to change product codes into product names.

Configuration

Input

Output

[
["decode"],
["json_load"],
["dictionary_replace", {"mapping": {"xkdruw67": "mp3player", "skr885jc": "dvdcleaner", "kk284jds": "cartridge"},"column": "product"}],
["json_dump"]
]

{"first": "Sam", "product": "xkdruw67", "event": "purchase"}
{"first": "John", "product": "skr885jc", "event": "purchase"}
{"first": "Elyse", "product": "kk284jds", "event": "purchase"}

{"__ia__unique__token__":6913206265407409391,"event":"purchase","product":"mp3player","first":"Sam"}
{"__ia__unique__token__":496410646764207962,"event":"purchase","product":"dvdcleaner","first":"John"}
{"__ia__unique__token__":9151368650616356276,"event":"purchase","product":"cartridge","first":"Elyse"}

divide 

Perform floating point division.

  • If divisor is a string, treat it as a column name.

  • If divisor column is null, result will be null.

Parameter

Type

Required?

Description

column

string

Yes

Column to use as multiplicand

divisor

string

Yes

Value or column to use as divisor

output_column

string

No

Column name to put the output into. Defaults to column.

Example 

Divide CPM column by one million.

Configuration

Input

Output

[
["decode"],
["json_load"],
["divide", {"column": "CPM", "divisor": 1000000, "output_column": "actual_cost"}],
["json_dump"]
]

{"first": "Sam", "event": "viewad", "CPM": 2385}
{"first": "John", "event": "viewad", "CPM": 4729}
{"first": "Elyse", "event": "viewad", "CPM": 1122}

{"event":"viewad","actual_cost":0.002385,"first":"Sam","CPM":2385,"__ia__unique__token__":2322298538924664253}
{"event":"viewad","actual_cost":0.004729,"first":"John","CPM":4729,"__ia__unique__token__":3128606018585001222}
{"event":"viewad","actual_cost":0.001122,"first":"Elyse","CPM":1122,"__ia__unique__token__":8493916172843408209}

flatten_dict 

Flatten dictionary by concatenating successive keys with a . ("dot") character.

Parameter

Type

Required?

Description

columns

list

No

List of column names to flatten. Defaults to all columns.

depth_limit

int

No

Recursion limit. Defaults to no limit.

Example 

Flatten user column one level.

Configuration

Input

Output

[
["decode"],
["json_load"],
["flatten_dict", {"columns": ["user"], "depth_limit": 1}],
["json_dump"]
]

{"user": {"name": "John", "info": {"phone": 5551234, "city": "Redwood City"}}, "event": "login"}

{"user.name":"John","event":"login","__ia__unique__token__":4975147593049633177,"user.info":{"phone":5551234,"city":"Redwood City"}}

hash 

Calculate non-cryptographic murmurhash3.

Parameter

Type

Required?

Description

column

string

Yes

The column to hash

output_column

string

No

The column to put the hash value in. The default value is column.

Example 

Hash username

Configuration

Input

Output

[
["decode"],
["json_load"],
["hash", {"column": "username"}],
["json_dump"]
]

{"username": "BenHogan", "event": "putt", "dist": 3}
{"username": "ArnoldPalmer", "event": "chip", "dist": 15}
{"username": "JackNicklaus", "event": "drive", "dist": 267}

{"username":8545539730387850342,"dist":3,"__ia__unique__token__":4345827873645831930,"event":"putt"}
{"username":4800646705760021970,"dist":15,"__ia__unique__token__":7955354759641172706,"event":"chip"}
{"username":1304984085549885728,"dist":267,"__ia__unique__token__":3709088310454520588,"event":"drive"}

if_then_otherwise 

Run steps conditionally. If condition evaluates to True for an individual event, the then list of steps will be run. Otherwise, the otherwise list of steps will be run (if provided).

Parameter

Type

Required?

Description

condition

step

Yes

Condition to determine which steps to run.

then

list

Yes

Steps to use if condition is True.

otherwise

list

No

Steps to use if condition is False.

Conditions  

Condition

Parameter 1

Type

Parameter 2

Type

regex_match

column

string

pattern

string

regex_search

column

string

pattern

string

contains

column

string

 

 

starts

column

string

prefix

string

ends

column

string

suffix

string

in_list

column

string

value

string

not_in_list

column

string

value

string

equals

column

string

value

string

not_equals

column

string

value

string

lt (less than)

column

string

value

string

gt (greater than)

column

string

value

string

lte (less than or equal)

column

string

value

string

gte (greater than or equal)

column

string

value

string

bool_eval

code

string

 

 

and_

conditions

list

 

 

or_

conditions

list

 

 

Example 

If the URL path indicates a payment, add the page = pay label.

Configuration

Input

Output

[
["decode"],
["json_load"],
["if_then_otherwise", {"condition": ["regex_match", {"column": "url.path", "pattern": "^\/pay\/.*"}], "then": [["add_label", {"column": "page", "label": "pay"}]]}],
["json_dump"]
]

{"user": "John", "url.path": "/pay/addcard/"}
{"user": "John", "url.path": "/dashboard/configure/"}

{"user":"John","__ia__unique__token__":6826495190851080943,"page":"pay","url.path":"/pay/addcard/"}
{"user":"John","__ia__unique__token__":4857886812416484204,"url.path":"/dashboard/configure/"}

join 

Join columns together with separator between each. Null columns are skipped.

Parameter

Type

Required?

Description

columns

list

Yes

List of columns names to join

output_column

string

Yes

Column in which to put the resulting string

separator

string

No

String to put between each each column. Defaults to the empty string.

Example 

Join first and last name columns into a fullname column.

Configuration

Input

Output

[
["decode"],
["json_load"],
["join", {"columns": ["first", "last"], "output_column": "fullname", "separator": " "}],
["json_dump"]
]

{"first": "Sam", "last": "Ryan"}
{"first": "John", "last": "Johnson"}
{"first": "Elyse", "last": "Jackson"}

{"__ia__unique__token__":6631810923594382537,"fullname":"Sam Ryan","first":"Sam","last":"Ryan"}
{"__ia__unique__token__":520329767215318629,"fullname":"John Johnson","first":"John","last":"Johnson"}
{"__ia__unique__token__":3271708847906864678,"fullname":"Elyse Jackson","first":"Elyse","last":"Jackson"}

json_load_column 

Deserialize a column containing a JSON encoded string. This is useful for CSV files containing JSON encoded strings.

Parameter

Type

Required?

Description

column

string

Yes

The column to operate on.

output_column

string

No

The column to put the output into. The default value is column.

Example 

Load value of embed as JSON.

Configuration

Input

Output

[
["decode"],
["json_load"],
["json_load_column", {"column": "embed"}],
["json_dump"]
]

{"user": John, "embed": "{\"action\": \"click\"}"}

{"__ia__unique__token__":673996973727761320,"embed":{"action":"click"},"user":"John"}

keep 

Keep only the specified columns, dropping all others. Does not drop the Scuba deduplication token.

Parameter

Type

Required?

Description

columns

list

Yes

List of column names to keep. 

Example 

Keep only user and event columns.

Configuration

Input

Output

[
["decode"],
["json_load"],
["keep", {"columns": ["user", "event"]}],
["json_dump"]
]

{"user": "Sam", "event": "login", "blah": "blah"}
{"user": "John", "event": "expand", "blah": "blah2", "junk": 1}
{"user": "Elyse", "event": "sendmsg", "blah": "blah3"}

{"event":"login","user":"Sam","__ia__unique__token__":8468468302936289505}
{"event":"expand","user":"John","__ia__unique__token__":2346607208574296553}
{"event":"sendmsg","user":"Elyse","__ia__unique__token__":7644083907636303077}

lowercase 

Convert all or some column names to lowercase strings. If you do not specify  columns , a lowercased column name will be created and set to null.

This is different from the lowercase_value() parameter. 

Parameter

Type

Required?

Description

columns

list

No

Columns to be renamed lowercase. Default is all columns.

Example 

Change uppercase characters in USERNAME to lowercase.

Configuration

Input

Output

[
["decode"],
["json_load"],
["lowercase"],
["json_dump"]
]

{"USERNAME": "BenHogan", "event": "putt", "dist": 3}
{"USERNAME": "ArnoldPalmer", "event": "chip", "dist": 15}
{"USERNAME": "JackNicklaus", "event": "drive", "dist": 267}

{"dist":3,"username":"BenHogan","__ia__unique__token__":8497066893068604652,"event":"putt"}
{"dist":15,"username":"ArnoldPalmer","__ia__unique__token__":587546423006921598,"event":"chip"}
{"dist":267,"username":"JackNicklaus","__ia__unique__token__":5859163929152062153,"event":"drive"}

lowercase_value 

Convert a column to a lowercase string.

Parameter

Type

Required?

Description

column

string

Yes

Column that contains values to be lowercased.

output_column

string

No

Column to put the output into. Defaults to column.

Example 

Convert values of text field to lowercase.

Configuration

Input

Output

[
["decode"],
["json_load"],
["lowercase_value", {"column": "text"}],
["json_dump"]
]

{"user": "Sam", "text": "Wet Paint"}
{"user": "John", "text": "STOP"}
{"user": "Elyse", "text": "Do Not Enter"}

{"__ia__unique__token__":2233568130467114876,"text":"wet paint","user":"Sam"}
{"__ia__unique__token__":4048437015924438854,"text":"stop","user":"John"}
{"__ia__unique__token__":5175482217519506709,"text":"do not enter","user":"Elyse"}

merge_keys 

Merge column_1 and column_2 into output_column using sha1.

Combine two mutually exclusive shard key columns into one, allowing them to share a table copy.

  • If column_1 is None and column_2 is not Noneoutput_column will be set to digest of column_2 with LSB set to 0.

  • If column_2 is None and column_1 is not Noneoutput_column will be set to digest of column_1 with LSB set to 1.

  • If both are set, or both are Noneoutput_column will be set to None.

Parameter

Type

Required?

Description

column_1

string

Yes

Column to merge and set LSB to 0.

column_2

string

Yes

Column to merge and set LSB to 1.

output_column

string

Yes

Column to put the result into.

Example 

Merge columns web1 and web2 into column merge.

Configuration

Input

Output

[
["decode", {"encoding": "utf8"}],
["json_load"],
["merge_keys", {"column_1": "web1", "column_2": "web2", "output_column": "merge"}],
["json_dump"]
]

{"web1": null, "web2": "w1", "val": 12}
{"web1": "u1", "web2": null, "val": 9}
{"web1": "u2", "web2": "w1", "val": 10}

{"val":12,"__ia__unique__token__":7297373526901799213,"merge":724008111296322,"web2":"w1","web1":null}
{"val":9,"__ia__unique__token__":7187483173479303532,"merge":3481771430954869,"web2":null,"web1":"u1"}
{"val":10,"__ia__unique__token__":2818846474846197375,"merge":null,"web2":"w1","web1":"u2"}

multiply 

Perform floating point multiplication.

  • If multiplier is a string, treat it as a column name.

  • If multiplier column is null, result will be null.

Parameter

Type

Required?

Description

column

string

Yes

Column to use as multiplicand

multiplier

string

Yes

Value or column to use as multiplier

output_column

string

No

Column name to put the output into. Defaults to column.

Example 

Multiply the price column by the quantity column and put the results in the total column.

Configuration

Input

Output

[
["decode"],
["json_load"],
["multiply", {"column": "price", "multiplier": "quantity", "output_column": "total"}],
["json_dump"]
]

{"first": "Sam", "event": "purchase", "price": 2.13, "quantity": 4}
{"first": "John", "event": "purchase", "price": 5, "quantity": 2}
{"first": "Elyse", "event": "purchase", "price": 8.56, "quantity": 12}

{"event":"purchase","quantity":4,"price":2.13,"first":"Sam","total":8.52,"__ia__unique__token__":2308855450660799997}
{"event":"purchase","quantity":2,"price":5,"first":"John","total":10.0,"__ia__unique__token__":2479674756660365726}
{"event":"purchase","quantity":12,"price":8.56,"first":"Elyse","total":102.72,"__ia__unique__token__":5658422012269400335}

name_value_extract 

Extract list of generic properties into one object.

Parameter

Type

Required?

Description

name_field

string

Yes

Name of field to treat as generic property name.

value_field

string

Yes

Name of field to treat as generic property value.

column

string

Yes

The list column to extract values from.

output_column

string

No

Column to put the output into. Defaults to column.

Example 

Extract price and category info from the item field.

Configuration

Input

Output

[
["decode"],
["json_load"],
["name_value_extract", {"column": "items", "name_field": "category", "value_field": "price"}],
["json_dump"]
]

{"user": "John", "items": [{"label":"tab.abc","category":"abc","price":17.80},{"label":"tab.f","category":"food","price":41.00}]}

{"items":{"food":41.0,"abc":17.8},"user":"John","__ia__unique__token__":1918698498990518500}

omit 

Omit the specified columns.

Parameter

Type

Required?

Description

columns

list

Yes

List of column names to omit

Example 

Omit column blah from events.

Configuration

Input

Output

[
["decode"],
["json_load"],
["omit", {"columns": ["blah"]}],
["json_dump"]
]

{"user": "Sam", "amt": 1, "blah": "blah"}
{"user": "John", "amt": 2, "blah": "blah2"}
{"user": "Elyse", "amt": 3, "blah": "blah3"}

{"user":"Sam","__ia__unique__token__":3893872012624762980,"amt":1}
{"user":"John","__ia__unique__token__":4673211942423344710,"amt":2}
{"user":"Elyse","__ia__unique__token__":4791908289683167877,"amt":3}

regex_array_extract 

For each element in an array, extract matching regular expression subgroups into new columns.

Parameter

Type

Required?

Description

column

string

Yes

Name of the column to extract from.

regex

string

Yes

Regular expression to match.

output_columns

list

Yes

List of column names to put regex subgroups into, in order.

Example 

Extract categories from list of products.

Configuration

Input

Output

[
["decode"],
["json_load"],
["regex_array_extract", {"column": "products", "regex": "^([a-z]*)-.*", "output_columns": ["categories"]}],
["json_dump"]
]

{"user": "John", "products": ["produce-hdy438ed", "cleaning-or944dus", "seafood-pd33sk2q"]}

{"user":"John","__ia__unique__token__":3040505961674288811,"categories":["produce","cleaning","seafood"],"products":["produce-hdy438ed","cleaning-or944dus","seafood-pd33sk2q"]}

regex_array_replace 

Perform regular expression replacement on each element in an array. Replacement string uses Python-style backreferences.

This will do nothing if the column is not an array.

Parameter

Type

Required?

Description

column

string

Yes

Column to perform replacement with.

regex

string

Yes

Regular expression to match.

repl_string

string

Yes

Replacement string to substitute.

output_column

string

No

Column name to put the output into. Defaults to column.

Example 

Use regex_array_replace to remove trailing identifiers from items in the products list.

Configuration

Input

Output

[
["decode"],
["json_load"],
["regex_array_replace", {"column": "products", "regex": "-[\S]*", "repl_string": ""}],
["json_dump"]
]

{"user": "John", "products": ["produce-hdy438ed", "cleaning-or944dus", "seafood-pd33sk2q"]}

{"user":"John","products":["produce","cleaning","seafood"],"__ia__unique__token__":3040505961674288811}

regex_extract 

Extract matching regular expression subgroups into new columns.

Parameter

Type

Required?

Description

column

string

Yes

Name of column to extract from.

regex

string

Yes

Regular expression to match with. Regex must match the entire string.

output_columns

list

Yes

List of column names to put regex subgroups into in order.

Example 

Extract plan query parameter from non-URL field uri.

Configuration

Input

Output

[
["decode"],
["json_load"],
["regex_extract", {"column": "uri", "output_columns": ["plan"], "regex": "(?:(?:.*)plan=([^&]*)?(?:.*)|.*)"}],
["json_dump"]
]

{"user": "John", "uri": "/channel/videoplay/?vidid=7246927612831078230&plan=pro&view=dark&loggedin=yes"}

{"user":"John","uri":"/channel/videoplay/?vidid=7246927612831078230&plan=pro&view=dark&loggedin=yes","__ia__unique__token__":6723934658596588009,"plan":"pro"}

regex_replace 

Perform regular expression replacement on column. The replacement string uses Python-style backreferences.

Parameter

Type

Required?

Description

column

string

Yes

Column to perform the replacement with.

regex

string

Yes

Regular expression to match.

repl_string

string

Yes

Replacement string to substitute.

output_column

string

No

Column name to put the output into. Defaults to column.

Example 

Use regex_replace to prefix the uri field with scheme and host.

Configuration

Input

Output

[
["decode"],
["json_load"],
["regex_replace", {"column": "uri", "regex": "^(\/)", "repl_string": "https://john.scuba.io"}],
["json_dump"]
]

{"user": "John", "uri": "/channel/videoplay/?vidid=7246927612831078230&plan=pro&view=dark&loggedin=yes"}

{"user":"John","__ia__unique__token__":6723934658596588009,"uri":"https://john.scuba.io/channel/vi...k&loggedin=yes"}

remove_punctuation 

Remove all punctuation from a column. Assumes unicode characters.

Parameter

Type

Required?

Description

column

string

Yes

The column to remove punctuation from.

output_column

string

No

Column name to put the output into. Defaults to column.

Example 

Strip punctuation from the message column.

Configuration

Input

Output

[
["decode"],
["json_load"],
["remove_punctuation", {"column": "message"}],
["json_dump"]
]

{"user": "Punctuation", "message": "!@#%&*()_-{}[]:;,.?"}
{"user": "NotPunctuation", "message": "$^+=<>"}
{"user": "Combined", "message": "!@#%&*()_-{}[]:;,.?$^+=<>"}

{"message":"","user":"Punctuation","__ia__unique__token__":1267513633946236610}
{"message":"$^+=<>","user":"NotPunctuation","__ia__unique__token__":1684850124204466571}
{"message":"$^+=<>","user":"Combined","__ia__unique__token__":4132463954692381155}

rename 

Rename a column.

Parameter

Type

Required?

Description

column

string

Yes

Column to rename

new_name

string

Yes

New name for the column

Example 

Rename column blah to event_type.

Configuration

Input

Output

[
["decode"],
["json_load"],
["rename", {"column": "blah", "new_name": "event_type"}],
["json_dump"]
]

{"user": "Sam", "amt": 1, "blah": "blah"}
{"user": "John", "amt": 2, "blah": "blah2"}
{"user": "Elyse", "amt": 3, "blah": "blah3"}

{"amt":1,"event_type":"blah","user":"Sam","__ia__unique__token__":3893872012624762980}
{"amt":2,"event_type":"blah2","user":"John","__ia__unique__token__":4673211942423344710}
{"amt":3,"event_type":"blah3","user":"Elyse","__ia__unique__token__":4791908289683167877}

shred_array 

Change array of objects into an object of arrays.

Parameter

Type

Required?

Description

column

string

Yes

The column to operate on

output_column

string

No

The column to add the output into. Defaults to column.

Example 

Shred array exp.

Configuration

Input

Output

[
["decode"],
["json_load"],
["shred_array", {"column": "exp"}],
["json_dump"]
]

{"user": "John", "exp": [{"var": "a", "val": "1"}, {"var": "b", "val": "4"}]}
{"user": "Fred", "exp": [{"var": "b", "val": "5"}, {"var": "c", "val": "4"}]}

{"__ia__unique__token__":7370096177795160827,"user":"John","exp":{"val":["1","4"],"var":["a","b"]}}
{"__ia__unique__token__":2876036297403557858,"user":"Fred","exp":{"val":["5","4"],"var":["b","c"]}}

split_array 

Split a column containing an array into multiple columns.

Parameter

Type

Required?

Description

column

string

Yes

Split this column into new columns if it is an array.

output_columns

string

No

Column names to use. Defaults to column suffixed with .N.

Example 

Split values in array exp into three new columns.

Configuration

Input

Output

[
["decode"],
["json_load"],
["split_array", {"column": "exp", "output_columns": ["animal1", "animal2", "animal3"]}],
["json_dump"]
]

{"user": "John", "exp": ["cat","dog","hamster"]}
{"user": "Fred", "exp": ["lion","tiger","bear"]}

{"exp":["cat","dog","hamster"],"animal1":"cat","animal3":"hamster","user":"John","__ia__unique__token__":2032956903660321772,"animal2":"dog"}
{"exp":["lion","tiger","bear"],"animal1":"lion","animal3":"bear","user":"Fred","__ia__unique__token__":4045256004748629181,"animal2":"tiger"}

time_convert 

Reformat a time column, or do nothing if none of the read formats match. Uses Python-style time format strings, or one of epoch_secondsepoch_millisepoch_micros.

If testing on the command line, the max_time_ago and max_time_hence parameters require passing the --file-date flag.

Parameter

Type

Required?

Description

column

string

Yes

The column to perform time conversion on.

output_column

string

No

The column to put the result in. Defaults to column.

read_formats

list

Yes

List of datetime format strings to try, in order.

write_format

string

No

Datetime format string for output. Defaults to %Y-%m-%dT%H:%M:%S.%fZ

max_time_ago

string

No

Events older than the file date minus max_time_ago hours will be silently dropped.

max_time_hence

string

No

Events more recent than the file date plus max_time_hence hours will be silently dropped.

max_abs_time

string

No

Events stamped after this ISO-8601 timestamp will be silently dropped.

min_abs_time

string

No

Events stamped before this ISO-8601 timestamp will be silently dropped.


Example 

Convert a variety of time formats into %Y-%m-%d %H:%M:%S.

Configuration

Input

Output

[
["decode"],
["json_load"],
["time_convert", {"read_formats": ["%A, %B %d %Y", "%m/%d/%Y", "%b %d, %Y", "%Y-%m-%d %H:%M:%S"], "write_format": "%Y-%m-%d %H:%M:%S", "column": "ts"}],
["json_dump"]
]

{"ts": "Tuesday, March 22 2016"}
{"ts": "3/22/2016"}
{"ts": "Mar 22, 2016"}
{"ts": "2016-03-22 12:43:30"}

{"__ia__unique__token__":1033586384520328482,"ts":"2016-03-22 00:00:00"}
{"__ia__unique__token__":6737880881032506082,"ts":"2016-03-22 00:00:00"}
{"__ia__unique__token__":803264224078215379,"ts":"2016-03-22 00:00:00"}
{"__ia__unique__token__":2708765703165006780,"ts":"2016-03-22 12:43:30"}

url_parse 

Parse URL into dictionary.

Parameter

Type

Required?

Description

column

string

Yes

Column containing the URL to be parsed

output_column

string

No

The column to put the output into. Defaults to column.

Example 

Parse the referrer URL into dictionary.

Configuration

Input

Output

[
["decode"],
["json_load"],
["url_parse", {"column": "referrer", "default_scheme": "https"}],
["json_dump"]
]

{"user": "John", "referrer": "john.scuba.io:80/videoplay?docid=-7246927612831078230&hl=en&view=dark&loggedin=yes#00h02m30s"}

{"__ia__unique__token__":7912114277290934867,"referrer":{"fragment":"00h02m30s","query":"docid=-7246927612831078230&hl=en&view=dark&loggedin=yes","scheme":"https","path":"/videoplay","params":"","port":80,"hostname":"john.scuba.io"},"user":"John"}

url_query_string_parse 

Parse a url query string of the form key1=value1&key2=value2.

Parameter

Type

Required?

Description

column

string

Yes

The column to remove punctuation from.

output_column

string

No

Column to put result in. The default value is column.

keep_parameters

list

Yes

Only keep parameters in the query string from this list.

Example 

Further parse referrer.query into selected parameters.

Configuration

Input

Output

[
["decode"],
["json_load"],
["url_parse", {"column": "referrer", "default_scheme": "https"}],
["flatten_dict", {"columns": ["referrer"]}],
["url_query_string_parse", {"column": "referrer.query", "keep_parameters": ["hl", "view", "loggedin"]}],
["json_dump"]
]

{"user": "John", "referrer": "john.scuba.io:80/videoplay?docid=-7246927612831078230&hl=en&view=dark&loggedin=yes#00h02m30s"}

{"user":"John","__ia__unique__token__":7912114277290934867,"referrer.port":80,"referrer.params":"","referrer.path":"/videoplay","referrer.fragment":"00h02m30s","referrer.scheme":"https","referrer.hostname":"john.scuba.io","referrer.query":{"hl":"en","view":"dark","loggedin":"yes"}}

whitelist_string_values 

Remove values from a string column that are not in the given list.

Parameter

Type

Required?

Description

column

string

Yes

Column to operate on

whitelist

list

Yes

List of allowed string values

replacement

string

No

The string used to replace values not in whitelist. Defaults to null.

Example 

Allow only accepted values in the source column.

Configuration

Input

Output

[
["decode"],
["json_load"],
["whitelist_string_values", {"whitelist": ["frontend", "backend"], "column": "source"}],
["json_dump"]
]

{"user": "Sam", "event": "click", "source": "frontend"}
{"user": "John", "event": "purchase", "source": "fronted"}
{"user": "John", "event": "load", "source": "backend"}
{"user": "Elyse", "event": "flush", "source": "backeend"}

{"event":"click","user":"Sam","__ia__unique__token__":6897873284472877507,"source":"frontend"}
{"event":"purchase","user":"John","__ia__unique__token__":5406739099689257143,"source":null}
{"event":"load","user":"John","__ia__unique__token__":7635827136580490088,"source":"backend"}
{"event":"flush","user":"Elyse","__ia__unique__token__":2880053427446368098,"source":null}

Complete the data transformation process 

json_dump 

Returns a dictionary as a JSON-encoded object. This is the only dictionary-to-line step. This should generally be the last step.

Example 

Transform Python dictionary to JSON for ingest.

Configuration

Input

Output

["json_dump"]

n/a

n/a