Skip to main content


Scuba Docs

Add data from Confluent

To ingest data from Confluent into your Scuba cluster, we recommend using the Kafka Connect Amazon s3 Sink connector. Once data is uploaded to s3, Scuba can download, transform, and import the data so that you can start querying and exploring.

For more information about the Kafka Connect Amazon s3 Sink connector, see the Confluent documentation.

Configure data options

In the connector, set the configuration options in  etc/kafka-connect-s3/ as follows:

Name of the s3 object. Scuba can accept the default format <prefix>/<topic>/<encodedPartition>/<topic>+<kafkaPartition>+<startOffset>.<format> for the name of the s3 object. In general, Scuba is flexible as to the name of the object.

Partitioning. Use the Time Based Partitioner (or Daily Partitioner or Hourly Partitioner).

Object Format. Data uploaded to s3 to be ingested into Scuba should be in JSON format.

Object Upload Rotation Strategy. If objects are being generated that are >100 MB in size, consider decreasing the partitioning intervals to produce smaller objects. Scuba prefers that the uncompressed version of the file be no bigger than 100 MB or so in order to avoid performance degradation with importing the data from s3.

Flush size. Set the flush.size configuration property according to how big each individual record is. An individual object should not be larger than 100 MB uncompressed.

Partition size. The s3.part.size configuration property default of 25 MB should be fine.

Schema Evolution. Scuba's schema evolution is backwards compatible. Once a new column is added to a table in Scuba, the column can be queried on for all events (although values for the column from events before the addition will be *null*). Because of this backwards compatibility, the default compatibility should work just fine. For example, say you want to update your data schema in your Kafka topic with a new column "foo". Once "foo" is uploaded in a new s3 object via the s3 connector, the Scuba software automatically detects column "foo"  upon importing the s3 object. Then, Scuba updates its table schema and applies "foo" retroactively to all previous events for the table.

For more information about these connector settings, see the connector documentation.

Connect to s3

Along with the data configuration properties, make sure that the configurations to connect to s3 are correct. All of these configurations can be set in etc/kafka-connect-s3/ For example:

  "name": "s3-sink",
  "config": {
    "connector.class": "io.confluent.connect.s3.S3SinkConnector",
    "tasks.max": "1",
    "topics": "<topic_name>", (for example: Scuba_topic)
    "s3.region": "<region_name>", (for example: us-west-2. NOTE: Bucket region should be the same as the region of your Scuba cluster)
    "": "<bucket_name>", (for example: my_Scuba_bucket)
    "s3.part.size": "5242880", (default size of 26214400 (25MB) should also be fine)
    "flush.size": "3",
    "storage.class": "",
    "format.class": "io.confluent.connect.s3.format.json.JsonFormat",
    "schema.generator.class": "",
    "partitioner.class": "",
    "schema.compatibility": "NONE",
    "name": "s3-sink"
  "tasks": []

Contact Scuba support to create a new pipeline

For creating a new pipeline, Scuba software needs the appropriate permissions to read and download files from the path in s3 (usually by specifying an ARN in IAM policy). In addition, please provide your Technical Account Manager or the Support Engineering team ( the following pieces of information:

  • whether the files are for an event table or a lookup table
  • the name of the table
  • the timestamp column
  • the shard key(s) column
  • file pattern or folder structure 
  • any transformations required
  • how often files are uploaded
  • how many days back from today should Scuba import search for new files
  • how many files are uploaded per day


  • Was this article helpful?