When you run a query that contains a count unique operation, Scuba might use an adaptive sampling algorithm, when the cardinality of the column is high. The count unique can be a top-level aggregator in the query, or included in the definition of a named expression used in the query. Adaptive sampling allows Scuba to return statistically significant results quickly even when processing queries that reference shard keys with a high number of unique values. 

With adaptive sampling, each shard sends a sample of values to the merge server. The merge server aggregates the truncated set of values, which limits the network, memory, and CPU resources required for the computation.

Although there is the risk that this can introduce a small amount of inaccuracy when computing a count unique on high cardinality columns (columns with a large number of unique values), in practice our users rarely reach the default sampling limit (8192 unique values).

Adaptive sampling and population sampling

The Sampled / All toggle at the bottom left of the Scuba UI determines whether Scuba performs population sampling, and is independent of adaptive sampling. Scuba might use the adaptive sampling algorithm for count unique queries even when running an unsampled query. See How Scuba performs data sampling for detailed information about how Scuba performs population sampling.

If Scuba did not use adaptive sampling, each shard in your cluster would have to return the entire list of unique values in the shard. Scuba would then compute the union of all unique values and perform the count operation. This is a resource-intensive operation: Scuba would need to perform operations on each shard, send the (large) results over the network, and then perform the count operation, requiring a large amount of CPU and memory resources on the merge server.

Why are we using this limit?

The sampling strategy that Scuba uses is taken from the paper "On Adaptive Sampling" (Flajolet 1990). We determined that this value provides accurate values, up to a 1% error rate (at most), and that error rate occurs only with data sets that include a greater than that number of unique values in the columns being analyzed.

We added the ability to configure this limit in Scuba version 2.24.2. You can now configure the adaptive sampling limits for shard key and non-shard key columns.  

When does adaptive sampling kick in?

Adaptive sampling activates when you run a query that either:

  • uses a count unique aggregation on a non-actor column; or

  • uses a count unique aggregation on an actor and also uses a time offset or split by.

For example, if a shard key (actor) is "user," running count unique user unsampled does not activate adaptive sampling. However, count unique user group by platform unsampled does activate adaptive sampling.

On Scuba version 2.x, adaptive sampling activates for any count unique on a column with a cardinality higher than the limits set on the system. By default, this limit is 8192 for both shard keys and non-shard keys.

Configuring the adaptive sampling limits

If you have access to the Scuba command line interface and the appropriate permissions level, you can configure the adaptive sampling limits. If you do have the required access, contact your technical account manager or support to change these settings.

Use the following CLI command to set the adaptive sampling values: 

ia settings update query_api adaptive_sampling_limits '{"<table_copy_id>": [<shard limit>, <non-shard limit>]}'
CODE

The configuration string is a JSON object, with string keys representing the table_copy_id, and each key value is a list of two numbers:

  • The first number is the limit to use when this count unique is running on a shard key on its own table copy.

  • The second number is the limit to use in all other cases.

The default sampling limit in both cases is 8192. If you want to change only one of the limits, you must set the other value to 8192 to preserve the default value. For example, if you want to lower your shard key limit to 4096, but leave the non-shard key limit at the default value, run the command with the values [4096, 8192].

Scuba automatically rounds any values up to the nearest power of 2. For example, if you specify 10000 as a limit, Scuba automatically rounds it up to 16384.

System performance considerations

Increasing the adaptive sampling limits can significantly affect system performance. You might need to increase the size of your Scuba cluster to preserve query performance. Before requesting to increase the limits, talk to your TAM about potential performance/infrastructure impacts.