This reference guide enumerates each of the data types supported by Scuba, and explains how Scuba determines which data type to assign at ingest time.

Data Types

These are the core data types you will see in the Scuba query UI.

Data Type

Aggregations

Grouping (Split By)

Filter Operators

Value Typeahead

Identifier

Count Unique, First, Last

Yes

is one of, is not one of

No

Integer* / Decimal

Count Unique, First, Last, Min, Max, Sum, Average, Median, Percentile 

Defaults to No, can be configured to Yes by admin

is one of, is not one of, is less than (<), is greater than (>), is less than or equal to (<=), is greater than or equal to (>=)

No

Integer Set / Decimal Set


Defaults to No, can be configured to Yes by admin

set contains, set does not contain

No

String

Count Unique, First, Last

Yes

is one of, is not one of, text contains, starts with, ends with

Yes

String Set


Yes

set contains, set does not contain

Yes

Time

Count Unique, First, Last, Min, Max, Sum, Average, Median, Percentile 

No

is one of, is not one of, is less than (<), is greater than (>), is less than or equal to (<=), is greater than or equal to (>=)

No

*Integers in Scuba are stored as 64 bit signed ints and so can only be a value from -(2^63) to 2^63 - 1.

Expansion Types

You won't see these as being first class data types in the Scuba query UI, but at ingest time we apply expansion rules based on these data types. 

Expansion Type

Data Type of Main Column

Generated Subcolumns

IP Address

String

city, region, country, continent

URL

N/A (not loaded into Scuba by default)

scheme, hostname, path, filename, query, params, fragment

User Agent

N/A (not loaded into Scuba)

browser, device, platform, browser_majorver, browser_minorver, browser_patchver

Ingest time auto-transformations

At ingest time, we automatically perform the following transformations on the raw JSON data before applying any data type recognition rules or expansions.

Transformation Rule

Original JSON

Resulting JSON

Flatten Nested JSON Objects

{"column": {"a": 1, "b": "xxx"}}

{"column.a": 1, "column.b": "xxx"}

Shred Arrays of JSON Objects

{"column": [{"a": 1,"b": "zzz"}, {"a": 2,"b": "yyy"}]}

{"column.a": [1,2], "column.b": ["zzz","yyy"]}

 String column transformations

If you have a regular string column that includes only the letters a through h, Scuba will auto-detect it as an integer and convert it to base 10. For example, "funding_series" which has values like (a, b, c, d, e) might fall into this scenario.

Original data

Transformed into...

"1234"

Integer values

"abc123"

Integer (converted to base 10), for values up to 16 hex digits

"abcdef-124"

Identifier, for values of 17 digits or more. The value can include hyphens. 

Ingest Time Data Type Recognition Rules For JSON Number Columns

At ingest time, when we see a new column for the first time (and it is a JSON Number) we detect the data type of the new column using the following matching rules, in the precedence order listed below:

Parsing Rule

Raw Data

Data Type

Rule Details

Detect Time

{"abc" : 1448933490}

Time

Scuba will attempt to interpret JSON ints as epoch timestamps in one of (microseconds, milliseconds, seconds). 

Detect Integer

{"abc" : 12345}

Integer

Simple JSON ints are interpreted as Integers by Scuba.

Detect Decimal

{"abc" : 12345.98}

Decimal


Detect Integer Set

{"abc" : [12345 245 99834]}

Integer Set


Detect Decimal Set

{"abc" : [12345.98 245.2 99834]}

Decimal Set


Based on the order of precedence, if there is ambiguity about whether a column value can be interpreted as an epoch timestamp or an int, Scuba will interpret it as an epoch time value. 

Ingest Time Data Type Recognition Rules For JSON String Columns

At ingest time, when we see a new column for the first time (and it is a JSON String) we detect the data type of the new column using the following matching rules, in the precedence order listed below:

Parsing Rule

Raw Data

Data Type

Rule Details

Detect Time From JSON String

{"abc" : "2015-11-30 08:09:12"}

Time

Scuba will attempt to interpret JSON strings as timestamps using approximately 40 different format strings (including ISO-8601).  

Detect Identifier from JSON String

{"abc" : "e41249ed-2398-4c29-a6fa-ee81116dd302"}

Identifier

Scuba will attempt to interpret JSON strings containing 16 or more characters as hexadecimal identifiers, including some common uuid formats. Note that non-hex characters (like hyphens or dots) are stripped out of the resulting data.

Detect Integer From JSON String

{"abc" : "12345"}

Integer


Detect Decimal From JSON String

{"abc" : "12345.98"}

Decimal


Detect Decimal From JSON String With Dollar Sign

{"abc" : "$12,345.98"}

Decimal

Note that dollar signs and commas are stripped out of the resulting data.

Detect String Set

{"abc" : ["hello" "goodbye" "nice" "to" "see" "you"]}

String Set


Detect URL

{"abc" : "http://www.site.com/landing/"}

URL


Detect IP Address

{"abc" : "127.0.0.1"}

IP Address


Detect User Agent

{"abc" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0)"}

User Agent

Scuba will attempt to interpret JSON strings as User Agents using a regex matching scheme.

Data Type Detection Settings For New Columns

Scuba allows administrators to change the default data type detection settings using the Scuba CLI. Note that these settings only affect how Scuba interprets new columns; once a column has been scanned by Scuba and its data type has been assigned, these settings will not modify the column. The main command is:

ia settings update purifier <Setting> <Value>
CODE

and the exact settings available are:

Setting

Value

Effect

strict_number_detection

1

Do not interpret JSON strings as Integer or Decimal. 

force_url_to_string

1

Do not interpret JSON strings as URLs.

add_full_and_parsed_url

1

In addition to expanding the pieces of the URL, also store the original full URL as a String column. (This takes extra space.)

force_geo_to_string

1

Do not interpret JSON strings as IP addresses.

force_useragent_to_string

1

Do not interpret JSON strings as User Agents.

force_hexn_to_string

1

Do not interpret JSON strings as hex identifiers. 

Note that the settings "add_full_and_parsed_url" and "force_url_to_string" are mutually exclusive.

More Information