Planning your Scuba deployment

This document offers guidelines for planning a Scuba deployment that meets your company's current data analytics needs, as well as scaled growth for the future.

Note that for most deployments, you will work with a Scuba representative to determine sizing, set up cluster(s), and deploy to production. If this is not the case, please reach out to help@scuba.io to ensure you have access to all necessary Scuba Admin documentation.

Scuba—behavioral analytics for your whole team

Scuba is full stack behavioral analytics software with a web-based visual interface and scalable distributed back-end datastore to process queries on event data.

Scuba's intuitive graphical user interface (GUI) provides interactive data exploration for a wide range of users across the spectrum of digital businesses. The Visual Explorer encourages rapid iteration with point-and-click query building and interactive visualizations. All without having to deal with complicated query syntax. You can go from any board to explore the underlying data, change parameters, and drill down to the granular details of the summary.

To plan a Scuba deployment that is optimized for your company, review the following topics:

High-level overview—how Scuba works

Scuba is full stack behavioral analytics software that allows users to explore the activity of digital services. Scuba includes both its own web-based visual interface and a highly scalable distributed back end database to store the data and process queries. Scuba supports Ubuntu 14.04.x in cloud environments, as well as virtual machines or bare-metal systems. Scuba enables you to ingest data in a variety of ways, including live data streams.

The following image shows an example of the flow of data into the Scuba cluster from imported files (single and batch) to live data streams from HTTP and Kafka sources. Ingested data is transformed and stored in the appropriate node, data or string, then processed in queries, with the results delivered to the requesting user.

The basics—system requirements

It is important to understand the hardware, software, and networking infrastructure requirements to properly plan your Scuba deployment. Before making a resource investment, read through this entire document. The following sections explain sizing guidelines, data formats, and platform specific information needed to accurately estimate a Scuba production system for your company.

A Scuba cluster consists of the following nodes that can be installed on a single server (single-node cluster), or across multiple servers (multi-node cluster).

config node — Node from which you administer the cluster. MySQL database (DB) is only installed on this node for storage of Scuba metadata. Configure this node first.
api node — Serves the Scuba application, merges query results from data and string nodes, and then presents those results to the user. Nginx is only installed on the API node.
import (ingest) node — Connects to data repositories (cloud, live streaming, remote or local file system), streams live data, downloads new files, processes the data and then sends to data and string tiers, as appropriate.
data node — Data storage. Must have enough space to accommodate all events and stream simultaneous query results.
string node — String storage for the active strings in the dataset, stored in compressed format. Requires sufficient memory to hold the working set of strings accessed during queries.
listener node — Streams live data from the web or cloud, also known as streaming ingest. Only applicable when ingesting from HTTP.

The following table outlines basic software, hardware, and network requirements for a Scuba deployment.

Software

Hardware

Browser

Networking

Operating system (OS):

Ubuntu 18.04

Linux desktop and server systems in a private cloud, or locally at your company site.

Minimum requirements for each node, unless otherwise noted:

4 CPU cores
8 GB RAM
100 GB storage

Microsoft Edge—Latest version

Chrome—Latest version

Gigabit network connection
All nodes must be on the same subnet.
Each node must have a routable IP address, that has not been remapped (via NAT).
There should be no restriction on local port connections.
Required open ports across the cluster are specified in Node configurations.

Guidelines for calculating the number of node type for your deployment are covered in the following sections.

Production workflow—test, review, revise, and go

It is strongly recommended that you first set up a sandbox test cluster (or work with your Scuba rep to have one set up for you) with a sample of your data, so any necessary adjustments can be made before deploying a production environment. This will allow you to assess the quality of your data and better determine the appropriate size for a production cluster.

For self-service Scuba deployments, follow these steps:

Review the rest of this document to gain a preliminary assessment of the needs for your production cluster.
Install a single node cluster in a sandbox test environment.
Load a week's worth of data.
Modify your data formats until you get the desired results.
Review the usage for the string and data nodes.
From the usage for one week's worth of data, you can estimate usage for one month, and then one year.
Factor in the estimated growth percentage for your company to create a multi-year usage projection.
Set up your Scuba production environment and ingest your data.

Cluster configurations—the big picture

This section provides an overview of a several typical cluster configurations:

Stacked vs. single nodes
Single-node cluster
Two node cluster
Five node cluster
Ten node cluster

Stacked vs. single nodes

The configuration of your cluster depends on the amount of data you have, its source, and projected growth. To validate the quality of your data and the analytics you wish to perform, we require a sample of your data prior to going into production.

Stacked nodes are not recommended for large event volumes or workloads, or if you are planning to scale your cluster at a later date.

For smaller deployments, it is possible to stack nodes—install multiple services on one system—in the following ways:

API, config, ingest—admin
string, data—worker

Review the following configuration examples and sizing guidelines, followed by data format and platform-specific issues.

Single-node cluster

The stacked single-node cluster is only recommended for sandbox test environments that are used to validate data and assess the requirements for a production cluster.

Two node cluster

The stacked two-node cluster configuration is for deployments with very low data volumes. In this configuration, the API and config nodes are stacked on the same node. A Listener node for streaming ingest from a live data source is not included, as this tier is optional.

Five node cluster

The following is an example of a basic five node cluster configuration, in which each functional component is installed on a separate system. A Listener node for streaming ingest from a live data source is not included, as this tier is optional.

Ten node cluster

The ten node cluster configuration illustrates how to scale a cluster for increased data capacity with added data and string nodes. To increase disk capacity, it is recommended that you scale the cluster horizontally to maintain your original performance.

Sizing guidelines—for today and tomorrow

Determining the appropriate Scuba cluster size for your company is largely based on the quantity of daily events that are generated. First, determine the current number of generated events, then estimate your company's projected data growth over a year, then multiply by the number of years in your expected growth cycle.

Consider the following when planning the size and capacity of your production Scuba cluster:

Node configurations
Capacity planning
Planning the data tier
Planning the ingest tier
Planning the string tier
Cluster configuration guidelines

Node configurations

The following table provides guidelines for Scuba cluster node configurations.

Component	Recommended configuration	Open ports across the cluster
Data nodes (#)	8 CPU Cores (1 CPU Core scans 1 billion rows of data) 64 GB (8 GB per core on the node) 1:1 ratio of disk space to total data size for initial test cluster. For example, 1 GB of data requires 1 GB disk space. Adjust the data:disk-space ratio after you know the compression rate of your data.	8500, 8050, 7070
Import nodes (#)	An import node can be deployed on its own for intensive ingest requirements, or co-located (stacked) with other nodes (such as a data node) for optimal performance. 8 CPU Cores 64 GB (8 GB per core on the node) 80 GB SSDs	8500, 8600, 8000
String nodes (3 or more)	8 CPU Cores (1 CPU Core scans 1 billion rows of data) 64 GB (8 GB per core on the node) 1:1 ratio of disk space to total data size Example: 1 GB of data requires 1 GB disk space	8050, 2000-4000
API node	60 GB disk space (per CPU Core) The API node includes the following: Analytics server (processes incoming query requests and returns the query results) Web service Front-end GUI The API node can be deployed on its own, or stacked with the config node.	80, 443, 8600, 8200, 8050, 8500, 8400
Config node	60 GB disk space (per CPU Core) The config node can be deployed on its own node or stacked with the API node.	3306, 8050, 8700
Listener node (optional)	Manages streaming ingest, live data (data-in-motion) from a web or cloud source. The three streaming ingest utilities (Listener, Kafka, and Zookeeper) can be deployed separately or on a single node.	2181, 2888, 3888, 9092

Capacity planning

Use the following guidelines to estimate the capacity for a Scuba production cluster.

Resource	How to calculate capacity
CPU	Billions of events / number of cores per data node
Storage	Bytes per event x number of events / available disk per data node
Memory	Memory = 0.35* disk requirements (for half the events in memory)
Data nodes (#)	Max (disk, CPU, memory)
Ingest nodes (#)	Number of data nodes / 4 during bulk import, then 1 after.
String nodes	Number of data nodes / 4, rounded to the closest odd number. Use 3 to 5 string nodes. Do not use less than 3. It is recommended that you use SSD storage for the string nodes. Bytes per event can be computed after importing a sample of data. 40 is a conservative estimate.

Planning the data tier

The main constraint for a data tier is disk space. However, beware of under-provisioning CPU and memory resources as it can result in reduced performance. Consider the following guidelines when planning a production data tier:

CPU

Memory and Disk Space

The number of rows and columns in a table directly affects performance. All columns matter, even if they are not utilized. Prior to importing data, omit columns that are not used.

If you have excessively large tables, consider increasing the CPUs of the data nodes.

Read performance scales linearly in proportion with the number of CPU cores.

A good estimate is that a data set with 1B events fits in 25 GB memory. Memory increases and scales linearly in proportion with the number of events and nodes.

Scuba compression rate is approximately 15-20x (raw data bytes to bytes stored on data servers).

DO NOT store data on the root partition.

AWS—Use either an ephemeral drive (SSD) or separate EBS volume to store data. A minimum GP2 is recommended.

Azure—Use premium local redundant storage (LRS)

For older generation CPUs, 8 MB of L3 cache is the required minimum. Pre-2010 CPUs are not supported.

Planning the import tier

Consider the following guidelines when planning a production import (ingest) tier:

Capacity

Sizing

50 GB of raw data / hour / core is a conservative estimate. A single 8 core import node is sufficient in most instances.

SSDs are recommended.

1 import node per 4 data nodes is recommended, in general. You can add ingest nodes to accommodate bulk imports, then remove the extra nodes after the bulk import is done. Performance scales linearly with the number of nodes.

Beware of over-provisioning. If the data or string tiers are over capacity, the ingest nodes can be affected causing a drop in performance.

Planning the string tier

Consider the following guidelines when planning a production string tier:

Capacity

Sizing

String columns typically have a lot of duplicates, but are stored only once, verbatim.

High-cardinality string columns impede performance, so it's fine to omit duplicate columns.

The main constraint is the disk space. SSDs are strongly recommended.

An odd number of string nodes is required. Use 3 or 5 such instances and keep an eye on the disk usage.

Cluster configuration guidelines

It is recommended that you follow the production workflow, and first deploy a sandbox test cluster with sample data to properly assess your data and accurately estimate a production cluster.

Use the following guidelines in planning your cluster:

For a sandbox test cluster, start with one main dataset on the cluster, with 1 shard key. A separate table copy is required for each shard key. This means that if you want N shard keys for your dataset, you need N times the storage on your data tier. Defining the appropriate number and type of shard keys may be an iterative process for your production cluster. It is recommended that you start small and add additional shard keys as necessary, since each shard key requires a full copy of the dataset.
Log data usually ranges from 50 - 150 bytes / event (storage required for 1 compressed event on the data tier). 80 bytes / event is a reasonable density estimate. High cardinality integer data, such as timestamps and unique identifiers and set columns, take up the most space in a dataset. If your data contains a lot of this type of data, estimate a few more data nodes for your production deployment.
String data should be stored efficiently. The following estimates account for data with several high cardinality string columns: 1 or 2 parsed URL columns, 1 user agent column, and 1 IP address column. Hex identifier columns should be stored as integers, through the use of a hex transform (applied by default in Scuba). Other "string" identifiers (like base64 encoded session ids) are hashed into integers and stored on the data tier.
Consider your expected data retention and adjust the configuration accordingly. You will need more data nodes for a 90-day retention than for 30 days.
Data input volume is another factor to consider when planning a cluster. For instance, a single-node cluster which has a total capacity of 24B events is adequate for 267M (24B / 90D) events per day. To import a higher volume of events per day, larger import and data tiers are required to keep up with the ingest rate and service queries in a timely manner.

Data types and formats—consider your source

It's important to consider the source of your data, as well the data type and how it's structured. Streaming ingest requires a special cluster configuration and installation. Be aware that some data types may require transformation for optimum analytics.

Data types

Scuba accepts the following data types:

JSON—Scuba's preferred data format. The JSON format is a flat set of name-value pairs (without any nesting), which is easy for Scuba to parse and interpret. If you use a different format, your data must be transformed into JSON format before it can be imported into Scuba.
Apache log format—Log files generated using mod_log_config. For details, see mod_log_config - Apache HTTP Server Version 2.4. It's helpful to provide the mod_log_config format string used to generate the logs, as Scuba can use that same format string to ingest the logs.
CSV—Scuba accepts CSV format with some exceptions. First, ensure sure you have a complete header row. Next, ensure you are using a supported separator character (tab, comma, semicolon, or ascii character 001). Finally, ensure your separator is clean and well-escaped—for example, if you use comma as your separator, make sure to quote or escape any commas within your actual data set.

For more information, see the Data types reference. For best practices in logging your data, see Best practices for formatting your data.

Data sources

Scuba can ingest event data a variety of ways:

Amazon Simple Storage Service (S3)—Cloud storage (Amazon Web Services, AWS)
Microsoft Azure—Blob storage
Google Cloud Platform—Unified object storage
Live data on an HTTP port—Streaming ingest
Local file systems—See the Ingest file types and formats reference

Logging and adding data

How your data is structured is important for optimum analytics results and performance. Review the following topics:

Platform-specific planning—how much is enough?

The following table compares system recommendations for cluster configurations on the following platforms:

On-premise installations should only use local SSDs, no Storage Area Networks (SAN).

The Azure recommendations in the following table allow you to add premium storage.

Configuration

System recommendation

Single-node:

1 node, 4 CPU, 8 GB memory

AWS—i3.xlarge or m5.xlarge,

Azure—DS3_V2

Stacked nodes:

2 nodes, 4 CPU, 16 GB memory

AWS:

API/config/ingest—C3.2xlarge
data/string—i3.xlarge

Azure: DS3_v2

Blob Storage is different between the 2 nodes:

Stacked API/config/ingest can use a P1 LRS storage.
Stacked data/string can use P2 LRS storage.

Small multi-node:

4 nodes, 4 CPU, 64 GB memory

Depends on the stacked configuration, but in general the following configurations are recommended.

AWS:

API/config—c5.2xlarge
ingest—r5.xlarge
data/string—i3.xlarge

Azure:

API/config—DS3_v2
ingest—DS4_v2
data/string—DS12_v2

Larger multi-node:

10 nodes, 4 CPU, 800 GB memory

AWS:

API—c5.2xlarge
config - m5.large
ingest—r5.xlarge
string—i3.xlarge
data—i3.xlarge

Azure:

API - DS13_v2
config, DS12_v2
import - DS12_v2
data - DS13_v2
string - DS13_v2