Before importing data into Scuba it is important to know the types of data that work best with scuba. Please read our Data Set Requirements Checklist before moving forward. You can also refer to our Scuba Lite Data Import document for more information and the resources below before moving forward with importing data. For any questions please reach out to Talk to an Expert.

Below is a sample Config.JSON file, which you will need during the import process for implementing Scuba Lite. Copy and paste the code snippet below and save it locally as config.json. The file name can be unique however it requires the config.json extension for import to work successfully (example: myDataSet_config.json). For any clarification on the terminology mentioned in this guide, please visit our guides and glossary.

Read on to learn how to edit a config.JSON file for Scuba Lite Import.

{
    "table": {
        "name": "myTableName",
        "type": "Event",
        "time_column": "timeStamp",
        "time_column_format": "%Y-%m-%d %H:%M:%S.%f",
        "shard_keys": ["user_id"],
    "columns": [
      {
        "name": "timeStamp",
        "type": "datetime",
        "conversion_params": "%Y-%m-%d %H:%M:%S.%f",
        "attributes": [
          "filterable"
        ]
      },
      {
        "name": "user_id",
        "type": "string",
        "conversion_params": "",
        "attributes": [
          "filterable",
          "Aggregable",
          "Groupable"
        ]
      }
    ]
  },
    "ingest": [{
        "name": "import-pipeline",
        "data_source_type": "azure_blob",
        "table_name": "myTableName",
        "description": "My pipeline!",
        "pipeline_type": "import",
        "data_source_parameters": {
            "storage_account": "scubaStorageAccount",
            "storage_key": "ueTm1rUmiIOFN7zrY5BGg5cg5wzrxJjEMvLcT5ytjBGO9PWMAvWge8z14T+Vvr6XSeol3xy1PasrNqqMQp4OqA==",
            "container": "scubaImportContainer",
            "file_pattern": "data_directory/*"
    },
        "data_transformations": [
            ["decode"],
            ["csv_load"],
            ["json_dump"]
        ]
    }]
}
CODE

Editing A Config File For Scuba Import

Creating A Table

You begin the ingest process by creating a table. In Scuba, a table refers to the contents of a database table, where each column represents a particular variable, and each row corresponds to a given member of the dataset.

The necessary configuration parameters to create a table with the config.json file are:

  1. “name”

    1. The name you assign to a table is how it will appear in the Scuba UI, and will be used to differentiate from other tables.

  2. “type”

    1. Scuba can ingest both Event and Lookup tables.

  3. “time_column”

    1. This input will be the exact name of the primary timestamp column in the Event table. Every table imported into Scuba requires a timestamp.

  4. “time_column_format”

    1. If your timestamp is already in epoch seconds or milliseconds input "time_column_format": "milliseconds" or "time_column_format": "seconds"

    2. If your Timestamp Format is not already in epoch milliseconds or seconds, Scuba requires that you format your timestamp data according to the ISO-8601 standard. For example, 2015-10-05T14:48:00.000Z, which has a format string of %Y-%m-%dT%H:%M:%S.%fZ

      If your timestamps do not follow the ISO-8601 standard or you cannot reformat your timestamps to follow the standard, Scuba also supports Unix time plus a variety of common strptime() time format strings. Note that for a given column in your data, the time format must remain consistent for every event.

      Scuba does not support dates prior to January 1, 1970 (the beginning of Unix epoch time).

  5. “shard_keys”

    1. When you initially create your table, the only columns defined are the time column and shard key columns. shard key is a column in your dataset that represents an important entity that the event is about. This is the data on which you want to perform your analysis.

  6. “columns”

    1. The columns object serves the purpose of identifying key parameters about the timestamp and shard keys input in the table object from step 4 and 5 above.

    2. In the example below we’re identifying timeStamp as our primary timestamp column and we’re identifying user_id from the data set as our shard key. You then have the opportunity to use the columns object to identify the column name, data type, and any conversion parameters. In this case there are no conversions necessary for the time_column_format as it is formatted in the original data, so we input the same value in the conversion_params field. This field can not be left blank. Below is a list of arguments.

      1. name — column name in your data; this is the name displayed in the UI unless display_name is set

      2. type — data type

      3. display_name — the name of the column displayed in the UI.

      4. conversion_params — Used for date_to_milli and identifier columns.

      5. attributes — Sets column attributes. 

    3. When setting attributes set your timestamp column to filterable and your shard keys should be filterable, Aggregable, and Groupable.

{
    "table": {
        "name": "myTableName",
        "type": "Event",
        "time_column": "timeStamp",
        "time_column_format": "%Y-%m-%d %H:%M:%S.%f",
        "shard_keys": ["user_id"],
    "columns": [
      {
        "name": "timeStamp",
        "type": "datetime",
        "conversion_params": "%Y-%m-%d %H:%M:%S.%f",
        "attributes": [
          "filterable"
        ]
      },
      {
        "name": "user_id",
        "type": "string",
        "conversion_params": "",
        "attributes": [
          "filterable",
          "Aggregable",
          "Groupable"
        ]
      }
    ]
  },
CODE

Creating A Pipeline

In Scuba, an ingest pipeline is how you move the data from Azure Blob or AWS S3. The ingest process requires a set of data set parameters to be able to connect to the data source, discover the files, and import them. Below is a list of the required parameters:

  1. “name”

    1. The name assigned to the ingest pipeline can be anything you want. This will be used to refer to the pipeline in the future.

  2. “data_source_type”

    1. This is where you denote what data source you wish to connect to. It can accept the following:

  3. “table_name”

    1. The table name input will determine which table the imported data is written to. It can only be the name of an already existing table or one selected in the table creation parameters in the same config.json file.

  4. “pipeline_type”

    1. This can be left as import for all cases.

  5. “file_pattern”

    1. File pattern is the folder structure where your data is stored in Azure Blob. We strongly recommend that you use the following format: mydata/{year}/{month}/{day}/{hour}/. We also recommend you bucket the timestamped data accordingly, meaning that files with timestamped values like “2020-01-17 05:24:35.55” should live in the folder structure of mydata/2020/01/17/05/.

  6. storage_account

    1. The name of the Azure storage account that the data you wish to import resides in.

    2. Please follow the hyper link on step 6 to set up a storage account if you have not already done so.

  7. storage_key

    1. The storage key of the Azure storage account that the data you wish to import lives under.

    2. Please follow the hyper link on step 7 to learn how to access your storage keys if you have not already done so.

  8. container

    1. The name of the container within the Azure storage account that the data you wish to import lives in.

    2. Please follow the hyper link on step 8 to set up a storage container in your storage account if you have not already done so.

    "ingest": [{
        "name": "import-pipeline",
        "data_source_type": "azure_blob",
        "table_name": "myTableName",
        "description": "My pipeline!",
        "pipeline_type": "import",
        "data_source_parameters": {
            "storage_account": "scubaStorageAccount",
            "storage_key": "ueTm1bNqzAJEN7zrY5BGg5cg5wzrxYxEmVLcT5ytjBGO9PWMAvWge8z14T+Vvr6XSeol3xy1PasrNqqMQp4OqA==",
            "container": "scubaImportContainer",
            "file_pattern": "data_directory/*"
    },
CODE

Selecting Data Transformations

JSON is the preferred format for Scuba. Other formats are also supported. However, all other formats require transformation into JSON before being ingested into Scuba. The format and organization of your data can affect how long it takes to process (transform). The more transformations that are required, the longer the ingest process will be. For more information, see Best practices for formatting data for ingest.

Scuba users can also perform a vast array of transformations on data import, such as column renames, flattening nested JSON, or adding labels. Please refer to our Transformer Library as a reference to perform any data transformations, or reach out to help@scuba.io.

"data_transformations": [
            ["decode"],
            ["csv_load"],
            ["json_dump"]
        ]
    }]
}
CODE