Import Data Deduplication [v5]
You’ve learned what happens behind the scenes when Scuba ingests your data. This document explains what safeguards are in place to ensure that Scuba does not ingest duplicate records.
Deduplication
Scuba has two layers of duplicate protection: file hashing and event checks.
File Hashing
First, Scuba evaluates every file that arrives in your cloud storage solution for its upload time, size, name, and other information. It then assigns it a hash. Scuba then begins ingesting this file into your data tier.
For ingest jobs that may revisit the same directory in your cloud storage solution over the course of days, your files are again evaluated and hashed. When a hash matches that of an existing file it is then ignored as a duplicate. If anything has changed about a previously downloaded file (upload time, size, arrangement of fields, column names etc.) a new hash will be generated. The file will be seen as new, and it will then be ingested.
Event Checks
For the second layer of deduplication, the data tier checks incoming events against stored events and discards events that are already stored. For a duplicate event to be dropped in this stage, the event must be exactly the same. All fields must be named the same, all properties must contain the exact same information, and all fields/properties must be in the same order.
Example:
File Hashing:
Scuba has a daily ingest job that has a 3-day look back, meaning it will scan its target directory structure for folders and files once a day for 3 days.
On Day one it encounters a file called muppets_6-23-2023.json that contains the following:
Timestamp | ID | Muppet Name | action | Language | location |
---|---|---|---|---|---|
6/23/2023 1:01:01 AM | 1 | Kermit | hosts | English | main stage |
6/23/2023 1:10:01 AM | 2 | Animal | plays drums | Howling | main stage |
6/23/2023 1:15:01 AM | 3 | Fozzy Bear | tells a joke | English | main stage |
6/23/2023 1:20:01 AM | 4 | Rolf | plays piano | English | piano bar |
6/23/2023 1:05:01 AM | 5 | Chef | throws fish | Swedish | kitchen |
A unique hash is created, logged and the file is ingested.
On Day two, ingest creates a hash for muppets_6-23-2023.json and compares it to the existing hash. The file is the same, the hashes are the same. Muppets_6-23-2023.json is ignored.
On Day three, ingest evaluates muppets6-23-2023.json. This time the file is:
Timestamp | ID | Muppet Name | action | Language | location |
---|---|---|---|---|---|
6/23/2023 1:01:01 AM | 1 | Kermit | hosts | English | main stage |
6/23/2023 1:10:01 AM | 2 | Animal | plays drums | Howling | main stage |
6/23/2023 1:15:01 AM | 3 | Fozzy Bear | tells a joke | English | main stage |
6/23/2023 1:20:01 AM | 4 | Rolf | plays piano | English | piano bar |
6/23/2023 1:05:01 AM | 5 | Chef | throws fish | Swedish | kitchen |
6/23/2023 1:25:01 AM | 6 | Gonzo | soars through the air | English | trapeze |
6/23/2023 1:30:01 AM | 4 | Rolf | takes a nap | English | back stage |
As another row has been added, the file size is different, the hash will be different and the file will be ingested.
Event Checking:
In this example, on the third day’s pass, only the new record will be ingested as event checking will evaluate each row for duplication, ignore the first 5 records, and ingest the last 2.
Timestamp | ID | Muppet Name | action | Language | location |
6/23/2023 1:25:01 AM | 6 | Gonzo | soars through the air | English | trapeze |
6/23/2023 1:30:01 AM | 4 | Rolf | takes a nap | English | back stage |