This topic explores the runtime architecture Interana uses to ingest data.
The diagram below (which reads from left to right) shows the flow of data through the system. Data arrives either as files (in one of our supported file formats) or as the payload of an HTTP call to our
add_events API. Interana then transforms and loads the data, eventually sending it to the Data and String servers where it becomes available for queries.
Ingest from S3 / Azure (aka File Scanner)
The File Scanner component is responsible for detecting and downloading files from a remote system (like Amazon S3 or Microsoft Azure Blob Storage). In addition to the connection information and credentials needed to connect to the remote system, the File Scanner requires information about which directory to scan and how files are timestamped within that directory. While running, the File Scanner keeps track of which files (name and size) have already been downloaded and processed, and by default will not process the same file more than once for the same table. The most common directives to give the File Scanner are:
- Continuous load: Continuously scan for new files that are timestamped "yesterday" or "today"
- Backfill: Do a single scan for files that are timestamped between two dates (2017-10-01 back to 2017-08-01)
Ingest through HTTP API (aka Kafka Consumer)
When the optional Listener tier is configured in your Interana instance, you can call the
add_events API to pass events via an HTTP GET or POST request. Interana uses a uwsgi server to listen for these events, and then places them on an internal Apache Kafka bus for robustness. The Interana Import server/s will pull events from the Kafka bus and flush them to the next stage of the ingest pipeline based on a time and volume algorithm (to balance latency and throughput).
The Transform Engine is a Python module that performs decompression, file format conversion, and JSON processing, based on a configuration provided by the admin user. This module provides a number of built-in transformations, and also allows the admin user to write custom Python code to transform data (subject to some sandboxing restrictions). See Transformer Library Steps and Examples for more information.
The JSON loader performs the following tasks:
- Flatten nested objects and arrays of objects
- Perform expansion of URL, User Agent and IP Address columns
- Detect the data type of new columns
- Parse all columns according to their types
- Send integers and strings to the Data and String servers respectively
Steps #1-4 above are mostly covered in more detail in the Data types reference.
For Step 5, note that if one particular Data server host is temporarily unavailable, the JSON Loader will "fail over" by sending data to a custodian host. Once data arrives at the custodian host, that host will make periodic attempts to send the data back to the natural shard where it belongs. In the meantime the data is not available for querying, but this approach has the benefit that the Import server can move on to ingest the next set of data.
De-duplication of events
A common problem in event analytics is that the same event might get ingested into the system more than once. This can happen if your logging pipeline accidentally puts the same raw event into multiple files, or if the same raw file is ingested into the system more than once.
Interana handles de-duplication by computing a "unique token" for each event and then rejecting events at the Data server layer if that token is not unique. By default, we compute the unique token by taking a hash of the entire event data row (as a JSON string). However, the Transformer Library provides other options for computing the unique token based on a particular set of columns in the data, and you can even pass in a unique token directly as part of the data you log.