Data types reference
This reference guide enumerates each of the data types supported by Scuba, and explains how Scuba determines which data type to assign at ingest time.
Data Types
These are the core data types you will see in the Scuba query UI.
Data Type | Aggregations | Grouping (Split By) | Filter Operators | Value Typeahead |
---|---|---|---|---|
Identifier | Count Unique, First, Last | Yes | matches/does not match | No |
Integer* / Decimal | Count Unique, First, Last, Min, Max, Sum, Average, Median, Percentile | Defaults to No, can be configured to Yes by admin | is one of/is not one of | No |
Integer Set / Decimal Set | Defaults to No, can be configured to Yes by admin | matches/does not match [column/value] | No | |
String | Count Unique, First, Last | Yes | matches/does not match | Yes |
String Set | Count Unique | Yes | string value matches/does not match | Yes |
Time | Count Unique, First, Last, Min, Max, Sum, Average, Median, Percentile | No | is one of, is not one of, is less than (<), is greater than (>), is less than or equal to (<=), is greater than or equal to (>=) | No |
*Integers in Scuba are stored as 64 bit signed ints and so can only be a value from -(2^63) to 2^63 - 1.
Expansion Types
You won't see these as being first class data types in the Scuba query UI, but at ingest time we apply expansion rules based on these data types.
Expansion Type | Data Type of Main Column | Generated Subcolumns |
---|---|---|
IP Address | String | city, region, country, continent |
URL | N/A (not loaded into Scuba by default) | scheme, hostname, path, filename, query, params, fragment |
User Agent | N/A (not loaded into Scuba) | browser, device, platform, browser_majorver, browser_minorver, browser_patchver |
Ingest time auto-transformations
At ingest time, we automatically perform the following transformations on the raw JSON data before applying any data type recognition rules or expansions.
Transformation Rule | Original JSON | Resulting JSON |
---|---|---|
Flatten Nested JSON Objects | {"column": {"a": 1, "b": "xxx"}} | {"column.a": 1, "column.b": "xxx"} |
Shred Arrays of JSON Objects | {"column": [{"a": 1,"b": "zzz"}, {"a": 2,"b": "yyy"}]} | {"column.a": [1,2], "column.b": ["zzz","yyy"]} |
String column transformations
If you have a regular string column that includes only the letters a through h, Scuba will auto-detect it as an integer and convert it to base 10. For example, "funding_series" which has values like (a, b, c, d, e) might fall into this scenario.
Original data | Transformed into... |
---|---|
"1234" | Integer values |
"abc123" | Integer (converted to base 10), for values up to 16 hex digits |
"abcdef-124" | Identifier, for values of 17 digits or more. The value can include hyphens. |
Ingest Time Data Type Recognition Rules For JSON Number Columns
At ingest time, when we see a new column for the first time (and it is a JSON Number) we detect the data type of the new column using the following matching rules, in the precedence order listed below:
Parsing Rule | Raw Data | Data Type | Rule Details |
---|---|---|---|
Detect Time | {"abc" : 1448933490} | Time | Scuba will attempt to interpret JSON ints as epoch timestamps in one of (microseconds, milliseconds, seconds). |
Detect Integer | {"abc" : 12345} | Integer | Simple JSON ints are interpreted as Integers by Scuba. |
Detect Decimal | {"abc" : 12345.98} | Decimal | |
Detect Integer Set | {"abc" : [12345, 245, 99834]} | Integer Set | |
Detect Decimal Set | {"abc" : [12345.98, 245.2, 99834]} | Decimal Set |
Based on the order of precedence, if there is ambiguity about whether a column value can be interpreted as an epoch timestamp or an int, Scuba will interpret it as an epoch time value.
Ingest Time Data Type Recognition Rules For JSON String Columns
At ingest time, when we see a new column for the first time (and it is a JSON String) we detect the data type of the new column using the following matching rules, in the precedence order listed below:
Parsing Rule | Raw Data | Data Type | Rule Details |
---|---|---|---|
Detect Time From JSON String | {"abc" : "2015-11-30 08:09:12"} | Time | Scuba will attempt to interpret JSON strings as timestamps using approximately 40 different format strings (including ISO-8601). |
Detect Identifier from JSON String | {"abc" : "e41249ed-2398-4c29-a6fa-ee81116dd302"} | Identifier | Scuba will attempt to interpret JSON strings containing 16 or more characters as hexadecimal identifiers, including some common uuid formats. Note that non-hex characters (like hyphens or dots) are stripped out of the resulting data. |
Detect Integer From JSON String | {"abc" : "12345"} | Integer | |
Detect Decimal From JSON String | {"abc" : "12345.98"} | Decimal | |
Detect Decimal From JSON String With Dollar Sign | {"abc" : "$12,345.98"} | Decimal | Note that dollar signs and commas are stripped out of the resulting data. |
Detect String Set | {"abc" : ["hello", "goodbye", "nice", "to", "see", "you"]} | String Set | |
Detect URL | {"abc" : "http://www.site.com/landing/"} | URL | |
Detect IP Address | {"abc" : "127.0.0.1"} | IP Address | |
Detect User Agent | {"abc" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0)"} | User Agent | Scuba will attempt to interpret JSON strings as User Agents using a regex matching scheme. |
Data Type Detection Settings For New Columns
Scuba allows administrators to change the default data type detection settings using the Scuba CLI. Note that these settings only affect how Scuba interprets new columns; once a column has been scanned by Scuba and its data type has been assigned, these settings will not modify the column. The main command is:
ia settings update purifier <Setting> <Value>
and the exact settings available are:
Setting | Value | Effect |
---|---|---|
strict_number_detection | 1 | Do not interpret JSON strings as Integer or Decimal. |
force_url_to_string | 1 | Do not interpret JSON strings as URLs. |
add_full_and_parsed_url | 1 | In addition to expanding the pieces of the URL, also store the original full URL as a String column. (This takes extra space.) |
force_geo_to_string | 1 | Do not interpret JSON strings as IP addresses. |
force_useragent_to_string | 1 | Do not interpret JSON strings as User Agents. |
force_hexn_to_string | 1 | Do not interpret JSON strings as hex identifiers. |
Note that the settings "add_full_and_parsed_url" and "force_url_to_string" are mutually exclusive.