This article explains the problems unbalanced actors can cause in a data tier, and how to redistribute them for more effective sampling. The following topics are covered:
- Unbalanced actors and the solution
- Working with your CSM to detect data whales
- Filtering out whales from queries
Unbalanced actors and the solution
In a perfect data tier events would be allocated equally across all shards, but this isn't always the case. There may be times when an actor creates a disproportionate volume of events in comparison to other actors, thereby causing a significant imbalance in your data. An extreme volume of events for any one actor can result in the following:
- Unbalanced sampled query results,
- Poor-performing shards that are bigger than average
- Disk space issues for the data node
Unbalanced actors = whales
A person who is an exceptionally big spender in a gambling casino is known as a whale. Likewise, in behavioral data analytics, an actor that creates a disproportionate volume of events compared to the other actors in a dataset is a whale. The following illustration is a simplified example of a data node with six shards. The purple shard is a whale.
Whale actors can be any of the following:
- Real actor—users who are an order of magnitude more active than most
- System actor—services, notifications, and other automated events
- Bot—a utility that automatically generates events
The imbalance a whale creates can cause sampling errors, especially if the whale happens to be picked as a sampling shard. For unsampled queries, the whale shard impedes performance. If the whale continues to expand unchecked, the node runs out of disk space much faster than the other nodes in the cluster. How do you resolve the imbalance a whale creates? You splash the whale, of course.
Rebalancing your data
Interana dissipates the data of an actor detected as a whale. In effect, the whale actor data is splashed across the other shards in the dataset. The act of splashing balances the distribution of data to ensure efficient sampling. Splashing is accomplished with shard function exceptions. Later when behavioral queries are run on the dataset, whale actors are filtered out. Splashing results in efficient sampling, but also has the following effects:
- The actor associated with the whale data goes away. You cannot run a behavioral query on that actor.
- The actor is filtered out of certain types of queries, since its data was effectively spread across all shards in the dataset.
Whale events are filtered out under specific circumstances only, as described in Filtering out whales from queries.
Working with your CSM to detect data whales
Interana can detect and redistribute data whales for Managed Edition customers. Communicating fully with your CSM is essential for successful rebalancing your data. For your CSM to detect whales in your data, they need to know the following:
- What is the name of the table with the suspected whale?
- What is the time interval in which the whale appears?
- How far back in time does the whale appear, and what is the length of the time periods?
- What is the name of the suspected whale actor?
Your CSM uses the following process to detect whales:
Query—An unsampled count event group query is run (by shard key, with a maximum of 1000 queries) on a specified table copy, and the results are stored.
Analysis—The results are examined. If there are fewer than 500 unique actors in a time period, or the mean count of events per actor is less than min-sample-mean, then the time period is excluded from analysis. For the remaining time intervals, each is analyzed for whale actors. If the event count for an actor is more than the outlier-threshold-stddev standard deviation above the mean, it is marked as a candidate whale for the time period.
Reporting—A table is printed listing the candidate whales, the total event count for each, and the number of time periods in which each appeared. You can choose to only include candidate whales that appeared in at least N periods, then re-run the analysis and view a report with different thresholds to determine which actors in the dataset are valid whales.
Generation—The candidate whales can then be automatically added to the shard function exceptions list of the table. This causes the whale actor events to be distributed across all shards in the dataset and excluded from behavioral and sharded count unique queries.
Filtering out whales from queries
Whale actor events are distributed across all shards, and therefore excluded from behavioral and sharded count unique queries. The following diagram illustrates the process for filtering out whales in per-actor metric, per-flow metric, and shard lookup table queries.