Data Set Requirement Checklist
Although Scuba has very limited requirements for the semi-structured data you can import and analyze, there are some requirements that your data must have. Please refer to our Scuba Lite Data Import document for more information and please read the resources below before moving forward with importing data. When reading our documentation note that the terms ACTOR and SHARD KEY are synonymous as well as ROW and EVENT.
Scuba allows for the import of semi-structured data requiring a minimum of two structured columns in each row and allowing for unstructured data to also be included in the same row. Each row of data can include entirely unique unstructured columns and values when compared to other rows in the same table, however the structured columns must be included.
Scuba also highly recommends the inclusion of a state column (commonly: event, eventType, eventName, action etc.) with values identifying the type of action or event that occurs at the time of the timestamp in that row. The state column name can follow any naming convention you desire.
Required Columns:
Timestamp Column
Timestamp column is in epoch seconds or milliseconds (any non epoch time that is in UNIX format will need to have its date-time type outlined in the config.json
time_column_format
field as well as in theconversion_params
field inside thecolumns
object)Minimum of one Shard Key Column
Recommended Columns:
State Column
Example of bare minimum row level requirements including recommended state column:
{"userid":"foo","timestamp":1051565134,"event":"Start"}
{"userid":"foo","timestamp":1051543123,"event":"Stop"}
{"userid":"bar","timestamp":1053400240,"event":"Start"}
{"userid":"zen","timestamp":1063334324,"event":"Start"}
Example of bare minimum row level requirements including semi-structured columns and values:
{"userid":"foo","timestamp":1051565134,"event":"Start","region":"united states","age":"49"}
{"userid":"foo","timestamp":1051543123,"event":"Stop","region":"united states","age":"49","campaign_code":"abc_123_xyz"}
{"userid":"bar","timestamp":1053400240,"event":"Start","device_type":"iPhone","url":"https://www.mypage.com"}
{"userid":"zen","timestamp":1063334324,"event":"Start","region":"china","campaign_code":"qrs_456_efg","device_type":"Android"}
Take note in the above example of how userid
, timestamp
, and event
keys are structured and present in every row, whereas region
, age
, campaign_code
, device_type
and, url
are unstructured and present when relevant to the row. Each row should attempt to define what happened at a particular point in time, referencing a single timestamp of when that event occured, a shard key denoting who or what generated the event, and a state column outlining exactly what that event is. The additional unstructured columns then go on to further describe the events by providing additional descriptive data points such as the region
or age
of the shard key/actor who generated the event.
In the above example on row 4 we can extrapolate that user zen
, performed a start
event at 1063334324
(note this is epoch millisecond formatted time), while in china
on an Android
and coming from the qrs_456_efg
campaign.