This document offers guidelines for planning an Interana deployment that meets your company's current data analytics needs, as well as scaled growth for the future.
Interana—behavioral analytics for your whole team
Interana is full stack behavioral analytics software with a web-based visual interface and scalable distributed back-end datastore to process queries on event data.
Interana's intuitive graphical user interface (GUI) provides interactive data exploration for a wide range of users across the spectrum of digital businesses. The Visual Explorer encourages rapid iteration with point-and-click query building and interactive visualizations. All without having to deal with complicated query syntax. You can go from any Living Dashboard to explore the underlying data, change parameters, and drill down to the granular details of the summary.
To plan an Interana deployment that is optimized for your company, review the following topics:
High-level overview—how Interana works
Interana is full stack behavioral analytics software that allows users to explore the activity of digital services. Interana includes both its own web-based visual interface and a highly scalable distributed back end database to store the data and process queries. Interana supports Ubuntu 14.04.x in cloud environments, as well as virtual machines or bare-metal systems. Interana enables you to ingest data in a variety of ways, including live data streams.
The following image shows the flow of data into the Interna cluster, from imported files (single and batch) to live data streams from HTTP and Kafka sources. Ingested data is transformed and stored in the appropriate node, data or string, then processed in queries, with the results delivered to the requesting user.
The basics—system requirements
It is important to understand the hardware, software, and networking infrastructure requirements to properly plan your Interana deployment. Before making a resource investment, read through this entire document. The following sections explain sizing guidelines, data formats, and platform specific information needed to accurately estimate an Interana production system for your company.
An Interana cluster consists of the following nodes that can be installed on a single server (single-node cluster), or across multiple servers (multi-node cluster).
- config node — Node from which you administer the cluster. MySQL database (DB) is only installed on this node for storage of Interana metadata. Configure this node first.
- api node — Serves the Interana application, merges query results from data and string nodes, and then presents those results to the user. Nginx is only installed on the api node.
- ingest node — Connects to data repositories (cloud, live streaming, remote or local file system), streams live data, downloads new files, processes the data and then sends to data and string tiers, as appropriate.
- data node — Data storage, must have enough space to accommodate all events and stream simultaneous query results.
- string node — String storage for the active strings in the dataset, stored in compressed format. Requires sufficient memory to hold the working set of strings accessed during queries.
- listener node — Streams live data from the web or cloud, also known as streaming ingest. Optional during installation.
The following table outlines basic software, hardware, and network requirements for an Interana deployment.
Operating system (OS):
Linux desktop and server systems in a private cloud, or locally at your company site.
Minimum requirements for each node, unless otherwise noted:
Microsoft Edge—Latest version
Guidelines for calculating the number of node type for your deployment are covered in the following sections.
Guidelines for calculating the number of each type of node for your deployment are covered in the following sections.
Production workflow—test, review, revise, and go
It is strongly recommended that you first set up a sandbox test cluster with a sampling of your data, so you can make any necessary adjustments before deploying a production environment. This will allow you to assess the quality of your data and the better determine the appropriate size for a production cluster.
Follow these steps:
- Review the rest of this document to gain a preliminary assessment of the needs for your production cluster.
- Install a single node cluster in a sandbox test environment, as described in Install single-node Interana.
- Load a week's worth of data.
- Modify your data formats until you get the desired results.
- Go to the Resource Usage page:
- Review the usage for the string and data nodes. For more information, see Track your data usage.
- From the usage for one week's worth of data, you can estimate usage for one month, and then one year.
- Factor in the estimated growth percentage for your company to create a multi-year usage projection.
- Install your Interana production environment and add your data.
Cluster configurations—the big picture
We recommend that you initially deploy a single-node sandbox cluster on which to test sample data. The sandbox test cluster will help you to accurately estimate the necessary size and capacity of a production cluster. This section provides an overview of a several
typical cluster configurations:
Stacked vs. single nodes
The configuration of your cluster depends on the amount of data you have, its source, and projected growth. To validate the quality of your data and the analytics you wish to perform, we recommend you install a test sandbox single-node cluster prior to going into production.
Stacked nodes are not recommended for large workloads, or if you are planning to scale your cluster at a later date.
For smaller deployments, it is possible to stack nodes—install multiple services on one system—in the following ways:
- API, config, ingest—admin
- string, data—worker
The stacked single-node cluster is recommended for sandbox test environments that are used to validate data and assess the requirements for a production cluster.
Two node cluster
The stacked two-node cluster configuration is for deployments with low data volumes. In this configuration, the API and config nodes are stacked on the same node. A Listener node for streaming ingest from a live data source is not included.
Five node cluster
The following is an example of a basic five node cluster configuration, in which each functional component is installed on a separate system. A Listener node for streaming ingest from a live data source is not included, as this tier is optional.
Ten node cluster
The ten node cluster configuration illustrates how to scale a cluster for increased data capacity with added data and string nodes. To increase disk capacity, it is recommended that you scale the cluster linearly to maintain your original performance.
Sizing guidelines—for today and tomorrow
Determining the appropriate Interana cluser size for your company is largely based on the quantity of daily events that are generated. First, determine the current number of generated events, then estimate your company's projected data growth over a year, then multiply by the number of years in your expected growth cycle.
Consider the following when planning the size and capacity of your production Interana cluster:
- Node configurations
- Capacity planning
- Planning the data tier
- Planning the ingest tier
- Planning the string tier
- Cluster configuration guidelines
The following table provides guidelines for Interana cluster node configurations.
|Component||Recommended configuration||Open ports across the cluster|
|Data nodes (#)||
||8500, 8050, 7070|
|Ingest nodes (#)||
The ingest node can be deployed on its own for intensive ingest requirements, or co-located (stacked) with other nodes (such as a data node) for optimal performance.
|8500, 8600, 8000|
(3 or more)
60 GB disk space (per CPU Core)
The API node includes the following:
The API node can be deployed on its own, or stacked with the config node.
|80, 443, 8600, 8200, 8050, 8500, 8400|
60 GB disk space (per CPU Core)
The config node can be deployed on its own node or stacked with the API node.
|3306, 8050, 8700|
|Listener node||Manages streaming ingest, live data (data-in-motion) from a web or cloud source. The three streaming ingest utilities (Listener, Kafka, and Zookeeper) can be deployed separately or on a single node. For more information, see Install multi-node Interana.||2181, 2888, 3888, 9092|
Use the following guidelines to estimate the capacity for an Interana production cluster.
|Resource||How to calculate capacity|
|CPU||Billions of events / number of cores per data node|
|Storage||Bytes per event x number of events / available disk per data node|
|Memory||Memory = 0.35* disk requirements (for half the events in memory)|
|Data nodes (#)||Max (disk, CPU, memory)|
|Ingest nodes (#)||
Number of data nodes / 4 during bulk import, then 1 after.
Number of data nodes / 4, rounded to the closest odd number. Use 3 to 5 string nodes. Do not use less than 3.
It is recommended that you use SSD storage for the string nodes. Bytes per event can be computed after importing a sample of data. 40 is a conservative estimate.
Planning the data tier
The main constraint for a data tier is disk space. However, beware of under-provisioning CPU and memory resources as it can result in reduced performance. Consider the following guidelines when planning a production data tier:
|CPU||Memory and Disk Space|
The number of rows and columns in a table directly affects performance. All columns matter, even if they are not utilized. Prior to importing data, omit columns that are not used.
If you have excessively large tables, consider increasing the CPUs of the data nodes.
Read performance scales linearly in proportion with the number of CPU cores.
A good estimate is that a data set with 1B events fits in 25 GB memory. Memory increases and scales linearly in proportion with the number of events and nodes.
Interana compression rate is approximately 15-20x (raw data bytes to bytes stored on data servers).
DO NOT store data on the root partition.
AWS—Use either an ephemeral drive (SSD) or separate EBS volume to store data. A minimum GP2 is recommended.
Azure—Use premium local redundant storage (LRS)
For older generation CPUs, 8 MB of L3 cache is the required minimum. Pre-2010 CPUs are not supported.
Planning the ingest tier
Consider the following guidelines when planning a production ingest tier:
50 GB of raw data / hour / core is a conservative estimate. A single 8 core ingest node is sufficient in most instances.
SSDs are recommended.
1 ingest node / 4 data nodes is recommended, in general. You can add ingest nodes to accommodate bulk imports, then remove the extra nodes after the bulk import is done. Performance scales linearly with the number of nodes.
Beware of over-provisioning. If the data or string tiers are over capacity, the ingest nodes can be affected causing a drop in performance.
Planning the string tier
Consider the following guidelines when planning a production string tier:
String columns typically have a lot of duplicates, but are stored only once, verbatim.
High-cardinality string columns impede performance, so it's fine to omit duplicate columns.
The main constraint is the disk space. SSDs are strongly recommended.
An odd number of string nodes is required. Use 3 or 5 such instances and keep an eye on the disk usage.
Cluster configuration guidelines
It is recommended that you follow the production workflow, and first deploy a sandbox test cluster with sample data to properly assess your data and accurately estimate a production cluster.
Use the following guidelines in planning your cluster:
- For a sandbox test cluster, start with one main dataset on the cluster, with 1 shard key. A separate table copy is required for each shard key. This means that if you want N shard keys for your dataset, you need N times the storage on your data tier. Defining the appropriate number and type of shard keys may be an iterative process for your production cluster. It is recommended that you start small and add additional shard keys as necessary, since each shard key requires a full copy of the dataset.
- Log data usually ranges from 50 - 150 bytes / event (storage required for 1 compressed event on the data tier). 80 bytes / event is a reasonable density estimate. High cardinality integer data, such as timestamps and unique identifiers and set columns, take up the most space in a dataset. If your data contains a lot of this type of data, estimate a few more data nodes for your production deployment.
- String data should be stored efficiently. The following estimates account for data with several high cardinality string columns: 1 or 2 parsed URL columns, 1 user agent column, and 1 IP address column. Hex identifier columns should be stored as integers, through the use of a hex transform (applied by default in Interana). Other "string" identifiers (like base64 encoded session ids) are hashed into integers and stored on the data tier.
- Consider your expected data retention and adjust the configuration accordingly. You will need more data nodes for a 90-day retention than for 30 days.
- Data input volume is another factor to consider when planning a cluster. For instance, a single-node cluster which has a total capacity of 24B events is adequate for 267M (24B / 90D) events per day. To import a higher volume of events per day, larger import and data tiers are required to keep up with the ingest rate and service queries in a timely manner.
Given these guidelines, the following table lists some cluster sizing examples, based on the number of events stored in the cluster. If you want to store more than 80 Billion events, it is recommended that you build one of the following clusters and import data first, to be able to more accurately size your larger cluster.
The AWS estimated costs in the following table are for backing up data and string nodes, and do not take into account organizational discounts. I also assume the use of 1-year partial upfront reserved instances to reduce costs.
|Example Cluster||Events||Nodes||AWS estimate|
1 stacked node: API, ingest, config, data, string
$4,713.76 / year (partial upfront RIs)
|Two node cluster||8B||
$8,388.40 / year (partial upfront RIs)
|Individual node cluster||24B||
$18,274.96 / year (partial upfront RIs)
$52,372.77 / year
Data types and formats—consider your source
It's important to consider the source of your data, as well the data type and how it's structured. Streaming ingest requires a special cluster configuration and installation (see Install multi-node Interana). Be aware that some data types may require transformation for optimum analytics.
Interana accepts the following data types:
- JSON—Interana's preferred data format. The JSON format is a flat set of name-value pairs (without any nesting), which is easy for Interana to parse and interpret. If you use a different format, your data must be transformed into JSON format before it can be imported into Interana.
- Apache log format—Log files generated using mod_log_config. For details, see http://httpd.apache.org/docs/current/mod/mod_log_config.html. It's helpful to provide the mod_log_config format string used to generate the logs, as Interana can use that same format string to ingest the logs.
- CSV—Interana accepts CSV format with some exceptions. First, ensure sure you have a complete header row. Next, ensure you are using a supported separator character (tab, comma, semicolon, or ascii character 001). Finally, ensure your separator is clean and well-escaped—for example, if you use comma as your separator, make sure to quote or escape any commas within your actual data set.
Interana can ingest event data a variety of ways:
Logging and adding data
How your data is structured is important for optimum analytics results and performance. Review the following topics:
Platform-specific planning—how much is enough?
The following table compares system recommendations for cluster configurations on the following platforms:
On-premise installations should only use local SSDs, no Storage Area Networks (SAN).
The Azure recommendations in the following table allow you to add premium storage.
1 node, 4 CPU, 8 GB memory
AWS—t2.xlarge or m3.xlarge,
2 nodes, 4 CPU, 16 GB memory
Blob Storage is different between the 2 nodes:
4 nodes, 4 CPU, 64 GB memory
Depends on the stacked configuration, but in general the following configurations are recommended.
10 nodes, 4 CPU, 800 GB memory