The purpose of this how-to is to provide practices on S3 bucket layout. If you use the following guidelines, you can be confident that your Scuba ingest is as efficient as possible with respects to your S3 bucket. Click a link to jump to a topic.
- How Scuba looks for new files to import in the S3 bucket
- Step 1: Organize the dataset
- Step 2: Organize by time of events in files
- Step 3: Organize by file source
- Step 4: Optimize file size
How Scuba looks for new files to import in the S3 bucket
Understanding how Scuba looks for files to import is key to understanding why we suggest the S3 bucket layout described in the following sections.
The configuration related to how data is imported into a cluster is stored in an "import pipeline". Each pipeline has the following important characteristics:
- The table the data will be imported into
- The pattern of the file names in s3 that should be imported (like bucketname/tableid/year/month/day/hour/file.gz)
- A transformer configuration that specifies any transformations to be made during ingest
Using these pipelines, we run one of two types of import jobs that actually bring the data into the system. These jobs scan the bucket for files that need to be processed, and then import those files. There are two types of import jobs that Scuba supports:
- One-time backfill jobs—imports all files available between two dates / times
- Continuous import jobs—continuously scans your bucket for new files to import
It is important that when we run one of these import jobs, we can identify the files that need to be downloaded and processed as quickly and efficiently as possible. The recommendations in this document help Scuba accomplish this task by addressing these potential inefficiencies:
- Both types of import jobs import files for a certain date or date range. If we organize our bucket by date, we can avoid scanning all files in the bucket to find new files that need to be processed.
- AWS only allows the listing of a s3 bucket by file name (not things like update time and file size), and that listing does not support wildcarding. Scuba can approximate wildcarding functionality by issuing many list requests, but that is inefficient and slows down the scanning process.
- If we can import multiple data sources to the same dataset using a single pattern, we can avoid the overhead of running many jobs.
Step 1: Organize by dataset
Now we're ready to work on our structure! Let's say that we have a bucket named
Scuba-s3-logs ready to go. For organizational purposes, it's a good idea to separate your bucket into the datasets that you will be importing into. This also allows Scuba to potentially import all files for a particular dataset with a single pipeline or job.
For instance, if you have 2 datasets in Scuba (clientlogs and serverlogs), you might start by structuring your bucket as follows:
Step 2: Organize by time of events in files
Immediately after some preliminary organization, you'll want to identify the year / month / date / (optional) hour that each file corresponds to (in UTC). So, if a file contains timestamps from 3:00 am UTC January 15, 2021, put that file in:
Including the hour allows us to import files from a specific hour in the day, which is occasionally useful, but not necessary. Running an hourly pipeline in a continuous job requires us to use 24 times more list requests than a daily pipeline, because we issue a list request for each hour. List requests aren't all that expensive so it's not a big issue, but when trying to minimize costs, we recommend the daily structure. It's fairly rare that we require an hourly pipeline.
Step 3: Organize by file source
Some datasets are made up of events from multiple data sources. Now that we have identified the time the files represent, we can specify the source in either an additional "folder" or just in the filename itself. Since you never know if you'll want to import different types of events into the dataset, you may want to specify the source even if you have just one source.
This allows us to set up separate pipelines if the different file types require separate transformer configurations – or if they do not, we can set up one pipeline for everything after the daily / hourly folders.
Step 4: Optimize file size
For best import speed and flexibility, aim for file sizes of 200–500 MB uncompressed. It is ok, and recommended, to compress files, but upon inflation aim for 200–500 MB. This allows Scuba to maximize import parallelism.