Although Scuba has very limited requirements for the semi-structured data you can import and analyze, there are some requirements that your data must have. Please refer to our Scuba Lite Data Import document for more information and please read the resources below before moving forward with importing data. When reading our documentation note that the terms ACTOR and SHARD KEY are synonymous as well as ROW and EVENT.

Scuba allows for the import of semi-structured data requiring a minimum of two structured columns in each row and allowing for unstructured data to also be included in the same row. Each row of data can include entirely unique unstructured columns and values when compared to other rows in the same table, however the structured columns must be included.

Scuba also highly recommends the inclusion of a state column (commonly: event, eventType, eventName, action etc.) with values identifying the type of action or event that occurs at the time of the timestamp in that row. The state column name can follow any naming convention you desire.

Required Columns:

  1. Timestamp Column

  2. Timestamp column is in epoch seconds or milliseconds (any non epoch time that is in UNIX format will need to have its date-time type outlined in the config.json time_column_format field as well as in the conversion_params field inside the columns object)

  3. Minimum of one Shard Key Column

Recommended Columns:

  1. State Column

Example of bare minimum row level requirements including recommended state column:


Example of bare minimum row level requirements including semi-structured columns and values:

{"userid":"foo","timestamp":1051565134,"event":"Start","region":"united states","age":"49"}
{"userid":"foo","timestamp":1051543123,"event":"Stop","region":"united states","age":"49","campaign_code":"abc_123_xyz"}

Take note in the above example of how userid, timestamp, and event keys are structured and present in every row, whereas region, age, campaign_code, device_type and, url are unstructured and present when relevant to the row. Each row should attempt to define what happened at a particular point in time, referencing a single timestamp of when that event occured, a shard key denoting who or what generated the event, and a state column outlining exactly what that event is. The additional unstructured columns then go on to further describe the events by providing additional descriptive data points such as the region or age of the shard key/actor who generated the event.

In the above example on row 4 we can extrapolate that user zen, performed a start event at 1063334324 (note this is epoch millisecond formatted time), while in china on an Android and coming from the qrs_456_efg campaign.