Skip to main content


Scuba Docs

Set up Scuba using an AMI

Welcome to Scuba!

This article helps you install and onboard data into an Amazon Machine Image (AMI) of a Scuba instance.

Provisioning Scuba

Provision Scuba using AWS as follows:

  1. In your AWS Management Console, click the Services tab, then click EC2.
  2. Under Create Instance, click Launch Instance.
  3. Under Quick Start, click My AMIs > Shared with me. If the AMI isn't available, try switching your region in the top right corner to be us-east-1 (N. Virginia).
  4. Find the Scuba AMI and click Select.
  5. Under Choose an Instance Type, choose r5d.xlarge (this machine type was chosen carefully, please use this one!). Click Next: Configure Instance Details.
  6. Accept the default values for Steps 3 and 4 (Configure Instance Details and Add Storage). In Step 5: Add Tags, you can optionally add tags. Otherwise, click Next: Configure Security Group.
  7. In Step 6, select an existing security group or create a new security group, then click Review and Launch
  8. Double check that:
    • you have an r5d.xlarge,
    • you have associated the instance with the proper security group, and
    • you have the proper tags associated with the instance.
  9. ​Click Launch.
  10. In the Select an existing key pair or create a new key pair dialog, we recommend Create a new key pair. Name your new key pair something easy to remember associated with your machine (ex: my-standalone-1). Then click Download Key Pair. Once you've downloaded your key pair, click Launch Instances.
  11. Click View Instances to watch your instance being provisioned. Watch its progress in the Status Checks column. A green checkbox indicates that it's finished.

Getting started

Now that you’ve provisioned Scuba, let’s get started with using it. You'll be SSHing into your AWS instance as the "ubuntu" user, and then creating an admin user. 

SSH into your instance

You will need:

  • The key pair you created at the end of the Provisioning Scuba instructions above.
  • The public DNS for your instance. Find the public DNS in one of two ways:
    • Click the Instances tab on the left and viewing the "Public DNS" column.
    • Or, select the Instance and click the Connect button at the top. The resulting prompt should reveal the public DNS.

To SSH into your AWS instance, select your instance and click the Connect button at the top. The resulting prompt should describe how to connect! If you'd like to use PuTTY to connect, there should be a link in the prompt. Otherwise, you'll be running a command similar to:

> ssh -i <key-pair-filename.pem>

Create an admin user

Once your terminal is connected to your AWS instance, you can administer Scuba with the program ia. Try this to start:

> ia --help

The --help flag also works for specific commands, for example

> ia user --help
> ia user create --help

This is the first command you'll want to run, to create a user for the Scuba UI:

> ia user create <email> <password>

Next, you'll probably want to make this user an admin:

> ia user update <email> --add-role admin

Accessing the UI

To access the UI:

  1. Using a web browser, visit the public IP/DNS of your instance.
  2. Proceed through the security warning (this message appears because there is no certificate installed, for more information on installing a certificate, see the Advanced section).
  3. Log in using the username and password you created in the previous step. 

You’ll find that there’s an "events" table already created which does not yet have any data! To import data into your Scuba instance, see the next section, Importing Data

Importing data

Prerequisite: JSON data, formatted as described in Data format below.

To import data:

  1. Get the data file onto your Scuba server.
  2. Move it to the directory /data/events

This puts it in a table in Scuba called "events", which has already been created.

The import takes a minute or two if the file is smaller than a few million records, and longer if it's larger than that.

There's also a sample file ~/sample_data.json for your use and reference. To load the sample data into Scuba, run

> mv ~/sample_data.json /data/events/

Data format

But, the data needs to be in a specific format!

It must be newline-delimited JSON (jsonl), which means it's a text file where each line is a json dictionary representing a single event.

Each JSON event must have a field called "time" in lowercase. The times must be in epoch milliseconds (that is, milliseconds since midnight January 1, 1970 UTC). 

Each JSON event should also have a field called "actor" in lowercase that describes who or what is performing the action. This is typically a user id, username, or device id. The actor field is the one on which you'll be able to ask behavioral questions, so it should be the thing whose behavior you care about.

Data format summary

To recap:

  •  Format your data as jsonl: each line is a json dict that represents an event.
  •  Make sure each line has an "actor" that's a string, and a "time" in epoch milliseconds.
  •  Put the file in /data/events/ on your Scuba instance.

Look at the results

Once you import some data, look in the UI, and you should see your data! For help running queries and creating boards, check out the tutorial.

If you don't see what you expect, debug with the following command:

> ia table status events

This gives you information about what data has been ingested to the table.

Possible import problems and solutions include:

Nothing shows up at all. This means Scuba did not find a file. Double check that you dropped the files in the correct directory, /data/events/

Files with a status of ERROR. This means Scuba failed with an entire file. Double check that your source files are text files (not compressed!) with one json dictionary per line.

Files with 0 parsed lines. This means that Scuba wasn't able to parse any lines in the file. Double check that your source files are text files (not compressed!) with one json dictionary per line.

Files with "lines without timestamp" that is more than zero. Double check that you have a column called “time” (lowercase) that's an integer representing the number of milliseconds (not seconds!) since the epoch.

Warning messages in the “Had Warnings” field. Scuba can have problems with any other value in the row. This doesn't stop it from importing the row, but the problematic value might not show up. For more information, inspect the end of the status output for some examples of the problems.

You see data but it's not in a format you like. You can change the format of your data and try again. See Iterate below.

If you really want the gory details, this is the raw log:

journalctl -u import-pipeline -f


It is usually the case that the first time you load your data it's not exactly how you want it. The easy way to iterate on this with the evaluation version of Scuba is to just delete and recreate the table and the source data as follows:

  1. Drop the Scuba table

> ia table drop <table_name>
  1. Recreate the table

> ia table create

This recreates the default table that Scuba shipped with. If you’ve created a fancier table definition (see Advanced below), then you’ll want to repeat that table create command here, not the default command.

In addition, whatever data that was in /data/events/ will be moved into  /data/archive/

  1. Drop your new files into /data/events/, and Scuba will see them and ingest them. This takes a minute or two if the data is small (a few million rows) or longer if it is larger.

Feel free to do this as many times as you like. If you’d like to keep around one version of a table while you try a new one, simply omit the drop command, and read the Advanced section below to learn more about creating tables.


The command ia table create has a number of additional options that might be useful.

To explain them, first some background. A table in Scuba has two columns that are special: the time column and the actor column(s).

The time column is used to place the event on the timeline for simple things like count per day, and more complicated ones like time between two steps in a flow. The same column must be used for time in all events, and it must have the same format.

The default time column is called “time” and is formatted as milliseconds since Jan 1, 1970, but both the column name and format can be overridden. You can change the column name with the -t flag to ia table create, and the format with the -f option. Valid options for format are ‘seconds’, ‘milliseconds’, ‘microseconds’, or a python strptime format such as %Y-%m-%dT%H:%M:%S.%fZ.

The actor column is the column whose behavior you care about. This is typically an ID or name for a user or device, or maybe something more abstract like a piece of content. Some of the fancier behavioral features in Scuba (such as flows) are limited to working only with an actor column. You can have more than one actor, but typically it’s fewer than five.

The default actor column is called “actor” and is of type “string”. You can override this with the -a flag to ia table create. This needs both the column name and the type, which is either string or int. If you want more than one actor column, you need to put in a separate -a. For example, if I have a string column user_id and an integer column anonymous_id I would use this command:

> ia table create my_data -a user_id string -a anonymous_id int

It's also worth noting that you can make a table pointing to a directory that already has data in it, and Scuba will import that data.

So if you want to change your table definition but not your source data, just drop and recreate the Scuba table without touching the source data.

Regarding data types, Scuba cares whether fields are numbers or strings, and it uses the json type to determine that.

So if you have a field "my_number": 123, it will show up as a number and you can use it to compute things, but if you have "my_number": "123", Scuba will interpret it as a string and you'll only be able to do string operations (like substring match) with it.

Simple nesting of dictionaries gets flattened in Scuba. That is,

{"foo": {"a":1, "b":2}}

gives you two columns, foo.a and foo.b.

Scuba can have columns that are lists of things, so if you have a list of simple types in your JSON, Scuba will probably do what you expect.

But Scuba does not import lists of dictionaries well. For example:

{"list_of_ints": [ 1, 2, 3], "list_of_strings": ["foo", "bar", "baz"]}

The above JSON is great. This will get you two columns in Scuba of type integer list and string list.

{"list_of_objects": [{"a":1, "b":2}, {"a":3, "c":4}]}

Not so great. Scuba will still import it, but you won't love the result. You'll get columns list_of_objects.a, list_of_objects.b, and so on, and it's difficult to query.

Regarding installing a certificate, this article should be helpful!

Also if you're interested in installing the AWS CLI to download files from s3 and putting them in /data/events, should be helpful!

Configuring BAQL

To query your data from outside of the Scuba server or the UI, try using BAQL. BAQL is the Scuba behavioral analytics query language that wraps around the Scuba query model. It’s similar to SQL but has some key differences. For more information about the syntax, see BAQL syntax and usage.

To get started, you’ll need a token. In your browser, log into Scuba. Then visit


Copy your token and put it in a safe and secure place. Then from a terminal or a script, you can use the token to run BQL queries. Here’s an example curl that counts all events from the beginning of time until now. 

curl -k -H "Authorization: Token vVnACM3xGPnmxEXse5A8dYVVxyl/YOpXlzYNZwRXgUDqveKpme+rfuFrFVMcZ8euJccQMm7kMstijL2kG+YNqxsDvb2e0000" https:/<public_ip>/v1/query -d '{"bql": "select count(*) from <table_name> between beginning_of_time and now"}'

  • Was this article helpful?