Skip to main content


Scuba Docs

Get data in using the CLI

This document is for Scuba 4.x. For Scuba 2.x, see the Scuba CLI reference and Scuba CLI ingest Quick Start.

This document explains the Scuba ingest process, and provides CLI walkthrough instructions for creating ingest jobs and pipelines. Click a link to jump directly to a topic:

About ingesting data

Ingest is the process of getting data into an Scuba cluster, whether it's a single file, a batch of files, or a dynamic stream of live data. There are three types of ingest:

  • one-time import — a file or batch of files, ingest of static data (data-at-rest), at a prescribed time. 
  • continuous ingest — a pipeline, like a cron job, that pushes content into the cluster at a steady (data-in-motion) rate.
  • streaming ingest — a dynamic flow of live data, a high volume of data received via HTTP from a cloud source. 

The structure and configuration of the data is important for a successful ingest and subsequent behavioral analysis. Data, table, and file types both play important roles. 

The tasks outlined in the ingest walkthrough are intended for one-time and continuous ingest jobs. For information on ingesting live data, see Streaming ingest Quick Start.

config file quick reference is provided at the end of this document.

Table types

Scuba stores raw event data in named datasets with a wide, flat schema known as event tables. You can join other external data sources to a dataset with lookup tables. For an initial ingest, first create an event table. You can create lookup tables at a later time. For more information, see the Scuba CLI Reference.

File formats

Scuba recommends the following data file types:

  • Newline delimited JSON — Scuba's preferred file format, fully supported for both event data and lookup tables.
  • CSV/TSV — Fully supported for lookup tables, but not recommended for event data. CSV/TSV does not easily handle changes to logging over time, unless the logging pipeline provides CSV/TSV headers in every file and begins writing a new file (with an updated header) every time the log format changes. CSV/TSV is also cumbersome for sparse data.

Ingest walkthrough

To import data:

  1. Create a table.
  2. Create an ingest pipeline.
  3. Create an ingest job.
  • An ingest pipeline is a shared configuration that is used to create ingest jobs. It specifies the data source and its location, and can authenticate the data source. An ingest pipeline streamlines the process of ingesting data on a frequent basis.
  • An ingest job is the execution of an ingest pipeline that runs over a set time period, ingesting data from the data source specified by the ingest pipeline.
  • Examplecreate a pipeline for an existing Kafka cluster.

The following table provides a list of tasks for pipelines and jobs. Generally, CRUD operations are available for both pipelines and jobs. Click a link to jump directly to step-by-step instructions.

Ingest pipelines Ingest jobs
Prerequisite: Create an initial table.
1. Create an ingest pipeline. 1. Create an ingest job.
2. Update or delete an ingest pipeline. 2. Update ingest jobs.
3. View ingest pipeline status. 3. Pause, resume, and delete jobs
4. Clone an ingest pipeline. 4. Display a list of jobs and job statistics.
5. Export an ingest pipeline. 5. Export table and job configurations.

Autocomplete should be enabled.

Create an initial table

You begin the ingest process—whether for a pipeline or an ingest job—by creating an initial table. There are two ways of creating a table:  

In Scuba 4.x, you must specify the shard key type for an event table at the time a table is created.

You can create a new table by specifying the table name, time column, shard key, and shard key type, or by using a configuration file. You can add and delete shard keys after you create a table. For more information, see the 3.0 Scuba CLI Reference.

Using a config file to create a table allows for greater customization, with the ability to specify columns and specify ingest pipelines, among other things. Lookup tables can only be created with a config file at this time. The config file must be a JSON object or Python dictionary, and at a minimum, you must provide the table configuration as a dictionary under the table key. For more information, see the table configuration parameters.

You should plan for cluster resize growth in your shard key layout, as well as for the current state of the cluster. For more information, see Planning your Scuba deployment.

To create an event table with CLI parameters, use the following command:
ia table create event <table_name> <time_column> <time_column_format> [-s <shard_key...> <shard_key_type>]

The following examples show various options for creating event tables.

ia table create event mycompany date '%Y-%m-%d' model -s serial_number int
ia table create event import_logs __time__ milliseconds -s transaction_id string -s shard_key string
ia table create event music ts milliseconds -s userId string
To create a lookup table with CLI parameters, use the following command:
ia table create lookup <table_name> <event_table_name> <lookup_column_name> <event_column_name>

The following examples show various options for creating lookup tables.

ia table create lookup model_info mycompany model model
ia table create lookup customers import_logs customer_id shard_key
ia table create lookup users music id user_id
To create a table with a config file, use the following command:
ia table create config-file <config_file_path>

The following is an example of an event table.

    "table": {
        "name": "mycompany",
        "type": "Event", # Event or Lookup
        "time_column": "date",
        "time_column_format": "%Y-%m-%d",
        "shard_keys": ["model", "serial_number"],
        # "columns" are optional, but it allows you to create columns upon table creation, instead of
        # waiting for the columns to appear in your data
        "columns": [
                "name": "model",
                "display_name": "model",
                "type": "string",
                "conversion_params": "",
                "attributes": [
                "name": "serial_number",
                "display_name": "serial_number",
                "type": "int",
                "conversion_params": "",
                "attributes": [

The following is an example of a lookup table.

    "table": {
        "name": "events_lookup",
        "type": "Lookup",
        "event_table_name": "events",
        "event_table_column_name": "shard_key",
        "lookup_column_name": "id"

Specify the ingest job in the table as a list of dictionaries under the ingest key, as shown in the following example.

    "table": { ...},
    "ingest": [{
        "name": "mycompany_import",
        "data_source_type": "aws",
        "table_name": "mycompany",
        "continuous": 0,
        "start": "today",
        "end": "today",
        "data_source_parameters": {
            "file_pattern": "datasets/mycompany/csv/",
            "s3_bucket": "mycompany-logs",
            "s3_access_key": "AKIAIOSFODNN7EXAMPLE",
            "s3_secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
        "data_transformations": [

Ingest pipelines

A pipeline is a configuration you can use for multiple jobs, streamlining the ingest process. A pipeline specifies the data source, its location, and can authenticate the data source. 


Before you begin creating an ingest pipeline you must Create an initial table

1. Create an ingest pipeline

You can create an ingest pipeline with the CLI, or with a config file. This section covers the following topics:

Create a pipeline with the CLI

Use the ia pipeline create command to create an ingest pipeline with the CLI. You can manually specify the pipeline parameters on the command line, or you can use a config file.

For Amazon S3 and Azure Blob data sources, it is NOT necessary to include a trailing * character as a wildcard (blob store semantics) in the pipeline. File system data sources require a trailing * character. See the following examples of pipelines that use each of these data sources.

To create a pipeline manually with the CLI, use the following command:

ia pipeline create <pipeline_name> <table_name> <data_source_type>
[--transformation <transformer_config_path>] -p <parameter_name> <value>

For a list of valid data source types and parameters, see Required ingest parameters.

The following is a file system data source pipeline example.

ia pipeline create events event file_system --transformation eventconfig -p file_pattern '/mnt/filer2/tank/pool/POC_data_set_cleansed_*'

The following is an Amazon S3 data source example that creates a my_import_stats pipeline to an import_stats table.

An aws data source is used, a my_transformer_config.tx transformations is called, and s3_access_key, s3_secret_access_keys3_bucket parameters are used.

ia pipeline create my_import_stats import_stats aws --transformation my_transformer_config.txt -p S3_access_key XXXXXXXXXXXXXXXXX -p s3_secret_access_key XXXXXXXXXXXXXXXXX -p s3_bucket Scuba-logs -p file_pattern logs/{year}/{month:02d}/{day:02d}/

The example produces the following results.

ID Name  Table Data Source File Pattern
1 my_import_stats import_stats aws logs/{year}/{month:02d}/{day:02d}/
Create a pipeline with a config file

Use the ia pipeline create command to create an ingest pipeline with a config file. The config format is the same as that used to create ingest pipelines with create table.  The config itself must be a dictionary, and the ingest configs must be a list of dictionaries under the ingest key.

Use the following command to create a pipeline with a config file

ia pipeline create --config-file <config_file_path>
Sample config


   "name": "my-company_import_stats",

   "data_source_type": "aws",

   "table_name": "import_stats",


   "data_source_parameters": {

       "file_pattern": "my-company1/{year:04d}-{month:02d}-{day:02d}/{hour:02d}/",

       "s3_bucket": "Scuba-logs",

       "s3_access_key": "AKIAIOSFODNN7EXAMPLE",

       "s3_secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"



   "advanced_parameters": {

       "minimum_disk_space": 6000000000,

       "concat_file_size": 500000000,

       "wait_seconds": 30,



   "data_transformations": [




       ["merge_keys", {"column_1": "batch_id","column_2": "query_api_id", "output_column": "transaction_id"}],

       ["merge_keys", {"column_1": "pipeline_shard_id","column_2": "user_id", "output_column": "shard_key"}],

       ["add_label", {"column": "customer", "label": "my-company"}],




Config example

The following example creates a pipeline using the import_stats_pipeline_configs.txt file that includes output data for the import_stats table.

ia pipeline create --config-file import_stats_pipeline_configs.txt
ID Name Table Data Source File Pattern
1 my-co1_import_stats import_stats aws my-co1/{year:04d}-{month:02d}-{day:02d}/import
2 my-co2_import_stats import_stats aws my-co2/{year:04d}-{month:02d}-{day:02d}/{hour:02d}/
3 my-co3_import_stats import_stats aws my-co3/{year:04d}-{month:02d}-{day:02d}/{hour:02d}/


Create a pipeline for an external Kafka cluster

You can pipe your event data through Kafka to other downstream systems where it is stored. This section demonstrates how to create a pipeline that ingests directly from an existing production Kafka cluster. Specify the following parameters:

  • data source type: kafka
  • topic_name: Kafka_topic
  • zookeeper_hosts: Comma separated list of zookeeper IP addresses and ports (host:port) of the zookeeper nodes managing the Kafka cluster.
  • from_offset:  Consume events starting from offset [earliest/latest]; earliest or latest, consume events from the beginning of the Kafka retention window or start consuming events "now".

Use the following syntax to create a pipeline for an external Kafka cluster.

ia pipeline create <pipeline_name> <data_source_type> -p zookeeper_hosts <host:port,host:port,...> -p from_offset <earliest|latest>

The following example creates a pipeline to the zookeeper host on port 200 to consume events from the beginning (earliest) of the Kafka retention window.

ia pipeline create Kafka_topic kafka -p zookeeper_hosts
-p from_offset earliest

2. Update or delete an ingest pipeline

This section explains how to update a pipeline and how to delete a pipeline.

Update a pipeline

You can update a pipeline with the CLI or a config file

Use the CLI to update a pipeline

Use the following command to automatically update pipelines with the CLI. For a list of valid parameters, see Required ingest parameters.

ia pipeline update <pipeline_name> [--transformation-config <transformer_config_path>] 
[-p <parameter_name> <value> ...]


ia pipeline update my-company_import_stats -p wait_seconds 50 -p concat_file_size 250000000 --transformation-config my-company_transformer_config.txt

Successfully updated my-company_import_stats

Use a config file to update a pipeline

Use the following command to automatically update the pipelines in a config file.

ia pipeline update --config-file <config_file_path>

You can specify select parameters for each pipeline you want to update. The format is the same as that used for creating an ingest job.

    "ingest": [
            "name": "example_import_stats",
            "advanced_parameters": {
                "minimum_disk_space": 1100000000,
                "concat_file_size": 550000000,
                "wait_seconds": 49,
            "name": "example2_import_stats",
            "data_source_parameters": {
                "s3_region": "us-west-1"


The following example updates the import_stats_update_pipelines.txt config file and shows output for the updated pipelines.

ia pipeline update --config-file import_stats_update_pipelines.txt

Updated Pipelines


Delete a pipeline

The ia pipeline delete command removes specified pipelines. Dry-run mode is the default. Use the -r or --run option to execute the command. Specify multiple pipelines by name in a space-separated list to delete multiple pipelines at once.

ia pipeline delete [<pipeline_name1> <pipeline_name2>] [--table <table_name>][-r/--run]


The following example deletes pipeline company1_import_stats and company2_import_stats.

ia pipeline delete company1_import_stats company2_import_stats --run

Successfully deleted pipelines: company1_import_stats, company2_import_stats.

3. View ingest pipeline status

You can display a list of pipelines in a table or show details for a specific pipeline.

Display a list of pipelines

Use the ia pipeline list command to display a list of available pipelines for a specific table. 

ia pipeline list [--table <table_name>]


The following example displays a list of pipelines in the import_stats table.

ia pipeline list --table import_stats

ID Name Data Source Table
1 finance_import_stats aws import_stats
2 marketing_import_stats aws import_stats
3 sales_import_stats aws import_stats
Show pipeline details

Use the ia pipeline show command to view detailed information about a specific pipeline.

ia pipeline show <pipeline_name>


The following example shows the details for the music-azure-pipeline.

ia pipeline show music-azure-pipeline

Key Value
Name music-azure-pipeline
Data Source azure-blob
Table music
Data Source Parameters

{u'storage_key': u'ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890/EIEIO==', u'sas_key': None, u'storage_account': u'Scubadevfiles', u'container': u'integration-test-files', u'file_pattern': u'Music'}

Advanced Parameters

{u'wait_seconds': 60, u'minimum_disk_space': 20000000000, u'concat_file_size': 1000000000}


4. Clone an ingest pipeline

Use the ia pipeline clone command to create a cloned copy of an existing ingest pipeline, with the ability to set new parameters for the cloned version of the pipeline.

ia pipeline clone <pipeline_to_clone> <new_pipeline_name> [-p/--parameter <parameter_name> <value> ...]


In the followinåg example the finance17_import_stats pipeline is cloned to finanace17_import_stats_cloned, specifying new user and hostname parameters.

ia pipeline clone finance17_import_stats finance17_import_stats_cloned -p user StarAdmin -p hostname BigHome

5. Export an ingest pipeline

Use the ia pipeline export command to export a pipeline configuration to a file. If an output file is not specified, the config is exported to the current directory with the <pipeline_name...>_pipline_config.txt format. 

ia pipeline export <pipeline_name> [--output-file <output_file_path>]


In the following example the finance17_pipeline is exported to finance17_year-end_export.txt.

ia pipeline export finance17_pipeline --output-file finance17_year-end_export.txt

Ingest jobs

Each ingest job has a set of required parameters that must be set, as well optional parameters. For a complete list of valid config file options and parameters see the Config file quick reference.


Before you begin creating an ingest job you must create an initial table and create a pipeline.

1. Create an ingest job

Use the ia job command to create an ingest job for a pipeline.

For a list of valid parameters, see Required ingest parameters.

You can override specific parameters in a pipeline with the -o/--override option. You can specify to override multiple pipeline parameters by preceding each parameter/value pair with -o option in a space-separated list: -o <parameter> <value> -o <parameter> <value >

Use the following command to create an ingest job with an existing pipeline. 

ia job create <pipeline_name / pipeline_ID> <type: continuous, onetime <start> <end> [--paused] [-n/--nodes <running_import_nodes>] [-o/--override <parameter_name> <value> ...]


In the following example, a continuous ingest job is created with the my-company1_import stats pipeline, starting yesterday and ending today.

ia job create my-company1_import_stats continuous yesterday today

In the following example, we create a onetime ingest job with the my-company1_import_stats pipeline, starting on
5-1-2017 and ending on 5-2-2017. We pause the running import node import000 and override the copy_id parameter with 123456789.

ia job create my-company1_import_stats onetime 2017-05-01 2017-05-02
--paused --running-import-nodes import000 --override copy_id 123456789

2. Update ingest jobs

You can update one or many parameters for a single job.

For continuous ingest jobs, you can specify a start date as an integer that translates to number of days. For example
--start 2 scans for data from 2 days before the end date, or current date if no --end is specified. Likewise, you can specify --end as an integer that translates to number of days after the start date, or current date if no --start is specified.

For continuous ingest jobs, you can specify a start date as an integer that translates to number of days. For example --start 2 will scan for data starting 2 days prior. The same holds true for --end. 


Update a single job with the following command:

ia job update <job_ID> [--start <start> --end <end> -o/--set-override <parameter_name> <value> --remove-override 
<parameter_name> -n/--running-import-nodes <running-import-node1 running-import-node2 ...>]


In the following example, we updated job_ID 1 to start 2 days before the current day, and set an override for the minimum disk space parameter on the running import node import000. The override for the copy_id parameter is also removed.

ia job update 1 --start 2 --set-override minimum_disk_space 10000000000
--running-import-nodes import000 --remove-override copy_id

3. Pause, resume, and delete jobs

You can pause, resume, or delete jobs, specifying multiple jobs in a space-separated list if desired. Use the --all option to perform the action on all applicable jobs. You can perform the action on the applicable jobs in a given --table--pipeline, or of a specific --type. The following guidelines apply:

  • The --all and --type options can be used together, but not with --table or --pipeline.
  • The --table and --pipeline options can be used with each other, but not with --all or --type.
  • Jobs with a done status cannot be deleted.

The pause, resume, and delete commands all have a similar syntax.

To pause a job, use this command:

ia job pause <job_ID> [--table <table_name> --pipeline <pipeline_name> --type <type> --all]

The following example pauses job_ID 1, 2, and 3.

ia job pause 1 2 3

The following example pauses all running jobs.

ia job pause --all

To resume a paused job, use this command:

ia job resume <job_ID> [--table <table_name> --pipeline <pipeline_name> --type <type> --all]

The following example resumes all the paused jobs for the import_stats table.

ia job resume --table import_stats

To delete a job, use this command:

ia job delete <job_ID> [--table <table_name> --pipeline <pipeline_name> --type <type> --all] -r/--run

The following example deletes the pipeline my-company1_import_stats.

ia job delete --pipeline my-company1_import_stats -r

4. Display a list of jobs and job statistics

You can display a list of all jobs, or specify options to tailor the list. You display all jobs by not specifying any options in the command. 

You can also display job statistics for a specific job, and optionally specify a time range over which to scan the data.

Display a list of jobs

Use the ia job list command to display a list of all jobs. You can choose from the options shown in the following command syntax to tailor the output of the list.

ia job list [-status <status: paused, running, done> --table <table name> --pipeline <pipeline_name> 
--type <type>]

The following example displays all jobs.

ia job list

Job ID Pipeline ID Pipeline Name Table Type Start End Status Overrides Running Import Nodes
1 1 c1_import_stats import_stats continuous yesterday today Running   All
2 2 c2_import_stats import_stats continuous yesterday today Running   All
3 3 c3_import_stats import_stats continuous yesterday today Running   All
4 4 c4_import_stats import_stats continuous yesterday today Running   All
Display job statistics

Use the ia job stats command to display statistics for a specific job for a specified a time period over which to scan the data. Stats are grouped by iteration_date and status, and include the number of files, total raw file size, total transformed file size, and total line count.

ia job stats <job_name> [-s/--start <start date YYYY-MM-dd or
YYYY-MM-ddTHH:mm:ss>] [-e/--end <end date YYYY-MM-dd or

If a job's file pattern does not include dates (year, month, day, hour), then the date field is empty, and the only grouping is for Finished and Error files.

ia job stats <job_ID> [--start <start> --end <end>]


The following example displays statistics for job_ID 6.

ia job stats 6

Date Status Files Raw File Size

Transformed File Size

Line Count
2016-10-10T00:00:00 Finished 34 2351345 5245724 635621
2016-10-11T00:00:00 Finished 23 1346136 4574635 435742
2016-10-12T00:00:00 Finished 16 1246575 4034695 309233
2016-10-13T00:00:00 Finished 26 2023546 4824084 490234
2016-10-14T00:00:00 Finished 39 2609324 6023821 720935


5. Export table and pipeline configurations

You can export table and pipeline configurations in the same format used by their respective export commands.

Export table configurations

When you export a table configuration, by default the file is saved in the current directory in the<table_name>_table_config.txtformat.

The table's columns and ingest jobs are exported, along with the basic table configuration.

To export a table configuration, use the following command:

ia table export <table_name> [-o/--output-file <output_file_path>]

In the following examples, only the table_name is specified in the first command. Both the table_name and the --output-file option are used in the second command.

ia table export import_stats
ia table export mycompany --output-file mycompany_export.txt
Export pipeline configurations

You can export pipeline configurations to a file. If an output file is not specified, the config is exported to the current directory with the <pipeline_name...>_job_config.txt format.

The following example uses the --output-file option to export two jobs to the today1_pipeline_config.txt config file.

ia pipeline export today1_20161010 today1_20161011 --output-file today1_job_export.txt

In the following example, the --output-file option is not used, and the resulting configuration file will be in the default format of music-a123e-pipe_pipeline_config.txt.

ia pipeline export music-a123e-pipe

Config file quick reference

Use this section to look up arguments and options for the ingest commands used in this document. For more information on the Scuba CLI, see the Scuba CLI Reference.

Table configurations

This section covers the parameters used by event tables and lookup tables. 

Event table configuration parameters:
  • name — table name
  • type — table type, event or lookup
  • columns — list of column config objects
  • time_column — name of the time column in your data
  • shard_keys — list of shard keys
  • time_column_format — not required if you provide the column definition of the time column, see Timestamps and date strings
  • colocated columns — an optional dictionary of colocation mappings

The following example shows a dictionary of colocation mappings under the colocated_columns parameter.

"colocated columns" : {
     "company_permalink" : "company_name"
Timestamps and date strings:

For Unix (epoch) timestamps, the options for the time_column_format column are:

  • seconds
  • milliseconds
  • microseconds

For date strings, the following formats are supported. For more information see Python datetime formatters

%Y%m%d %Y%m%d %H:%M:%S %Y%m%d%H%M%S %Y%m%dT%H%M%S
%Y-%d-%m %Y-%m-%d %Y-%m-%d %H:%M %Y-%m-%d %H:%M:%S
%Y-%m-%d %H:%M:%S.%f %Y-%m-%dT%H:%M:%S%Z %Y-%m-%dT%H:%M:%S.%fZ %Y-%m-%dT%H:%M:%SZ
%d-%m-%Y %d-%m-%y %d/%b/%Y:%H:%M:%S %d/%m/%Y
%d/%m/%Y %H:%M %d/%m/%Y %H:%M:%S %d/%m/%Y %I:%M %p %d/%m/%Y %I:%M:%S %p
%d/%m/%y %d/%m/%y %H:%M %d/%m/%y %H:%M:%S %d/%m/%y %I:%M %p
%d/%m/%y %I:%M:%S %p %m-%d-%Y %m-%d-%y %m/%d/%Y
%m/%d/%Y %H:%M %m/%d/%Y %H:%M:%S %m/%d/%Y %I:%M %p %m/%d/%Y %I:%M:%S %p
%m/%d/%y %m/%d/%y %H:%M %m/%d/%y %H:%M:%S %m/%d/%y %I:%M %p
%m/%d/%y %I:%M:%S %p %y-%d-%m %y-%m-%d  
Lookup table configuration parameters:
  • event_table_name — name of event table to which the lookup table is joined
  • event_table_column_name — name of event table column to which the lookup table is joined
  • lookup_column_name — name of lookup column to which the event table is joined
  • force_non_sharded_lookup — optional, if you're creating a lookup table joined on a shard key, you can force the lookup table to be non-sharded with this parameter

The following example illustrates how to force a lookup table to be non-sharded with the use of the force_non_sharded_lookup parameter.

"table" : {
     "name" : "events_lookup_ns",
     "type" : "Lookup",
     "event_table_name" : "events",
     "event_table_column_name" : "shard_key",
     "lookup_column_name" : "id",
     "force_non_sharded_lookup" : 1    


A column requires two parameters:

  • name — column name in your data; this is the name displayed in the UI unless display_name is set
  • type — data type

For further customization, you can use the following column parameters:

  • display_name — the name of the column displayed in the UI.
  • conversion_params — Used for date_to_milli and identifier columns.
  • attributes — Sets column attributes. 
  • deleted — Hides the column in the UI. If set to true, data continues to be imported even though column is hidden. To actually remove the column, (hide from UI and stop importing data), you'll also need to set the "type" to omit.
     "name" : "event",
     "display_name" : "event",
     "type" : "string",
     "conversion_params" : "",
     "attributes" : [
     "deleted" : 0    
Column Types

Here are all the possible column types for the types column parameter:

  • int
  • identifier
  • decimal
  • dollars
  • int_set
  • string_set
  • string
  • url
  • ip
  • user_agent
  • seconds
  • milliseconds
  • microseconds
  • datetime
Column attributes

The values for column attributes are the same as those stored in the database.

  • filterable — This can be used in filters.
  • Aggregable — You can use this for column as an aggregation measure (Sum, Average, Median, etc.).
  • Groupable — You can use this column in Groups, in the compare section. You can split the results of a query with this column.
  • samplesDefault — In the samples view, this column will be displayed by default.


Each ingest pipeline has a set of required parameters that must be set, as well optional parameters.

Required ingest pipeline parameters
  • name — job name to be run.
  • data_source_type — type of system that the data is stored in, see the following table.
  • table_name — name of the table in which the data will be imported. 
  • data_source_parameters — dictionary of parameters required to connect to the data source and discover the files to be imported, see the following table.
data_source_type Use this parameter     
AWS aws    
Azure azure_blog    
Local file system file_system    
Remote file system remote_fs    
Kafka (external) kafka    
HTTP ingest (streaming ingest) http_listener    

All data_source_parameters require a file_pattern parameter that is stored in the database. The following table lists supported file systems and their respective data_source_parameters.

Remote File System
Local File System
s3_bucket storage_account user


s3_access_key storage_key hostname  
s3_secret_access_key sas_key port  
s3_region container ssh_key  

For Azure file systems, you must set either the storage_key or sas_key parameter

Optional ingest pipeline parameters
  • data_transformations — Specifies the transformation config.
  • advanced_parameters — Specifies a dictionary of less commonly used parameters, common to all data sources.

Optional advanced parameters

  • directory — The working directory of the job, which is the location on the disk where files will be downloaded and transformed. 
  • minimum_disk_space — Threshold for minimum disk space (in bytes) needed to start a new batch process. There are disk space checks that make sure the import-pipeline doesn't fill up the disk. The minimum_disk_space value plus the size of the current batches determine whether a new batch can be started. Example: If minimum_disk_space = 20000000000 (20GB); in-flight batches = 5000000000 (5GB); proposed batch = 1000000000 (1GB); then 26GB of free disk space is needed to continue with the new batch.
  • concat_file_size — Specifies the size of the batch files in bytes. Each batch of files is downloaded, transformed, and then purified (sent to data and string tiers). Example:  If the data source has files of 100MB, 10 can be processed at one time: concat_file_size = 1000000000 (1GB) 
  • copy_id — Allows you to import the same file more than once. A hash of the filename is generated, then the database is checked to determine whether the file has been imported or not.  If the copy_id value is set, the value is added to the filename so the generated hash is different from the (original) hash of the filename). This parameter is generally used to do backfills over time ranges for previously imported files. Example: filename = data/backblaze/2014-01-01.csv; copy_id = 123456789  With the copy_id set, "data/backblaze/2014-01-01.csv123456789" is hashed as well as "data/backblaze/2014-01-01.csv".
  • wait_seconds — Specifies the wait time (in seconds) before scanning the data source for new files. The default is 60 seconds. Example: The s3 bucket is scanned for new files. Then, there is a wait time of 60 seconds before a new scan starts.
  • max_concurrent_batches — For each job, the number of batches that can be processed at once. The default is the number of CPUs.
  • max_scan_threads— The maximum number of threads used to scan the data source. The default is 20.


This section covers known issues you may encounter while ingesting data, along with the recommended workarounds.

Single-node cluster runs out of space during continuous import

Known issue: On a single node cluster, when continuous import jobs are run while also processing queries, import may stop due to lack of space.

Workaround: Consider resizing your cluster to accommodate the workload. For an immediate fix, use the following commands to reset the cache to allow for more space, and then restart the cluster. The Scuba service must be restarted to apply the new setting.

ia settings update query-server cache_evict_at_disk_free_percent 25

sudo /opt/Scuba/backend/common/py/Scuba_init.pyc

sudo service Scuba restart