Skip to main content
Skip table of contents

Import Data Deduplication [v5]

You’ve learned what happens behind the scenes when Scuba ingests your data. This document explains what safeguards are in place to ensure that Scuba does not ingest duplicate records.

Deduplication

Scuba has two layers of duplicate protection: file hashing and event checks.

File Hashing

First, Scuba evaluates every file that arrives in your cloud storage solution for its upload time, size, name, and other information. It then assigns it a hash. Scuba then begins ingesting this file into your data tier.

For ingest jobs that may revisit the same directory in your cloud storage solution over the course of days, your files are again evaluated and hashed. When a hash matches that of an existing file it is then ignored as a duplicate. If anything has changed about a previously downloaded file (upload time, size, arrangement of fields, column names etc.) a new hash will be generated. The file will be seen as new, and it will then be ingested.

Event Checks

For the second layer of deduplication, the data tier checks incoming events against stored events and discards events that are already stored. For a duplicate event to be dropped in this stage, the event must be exactly the same. All fields must be named the same, all properties must contain the exact same information, and all fields/properties must be in the same order.

Example:

File Hashing:

Scuba has a daily ingest job that has a 3-day look back, meaning it will scan its target directory structure for folders and files once a day for 3 days.

On Day one it encounters a file called muppets_6-23-2023.json that contains the following:

Timestamp

ID

Muppet Name

action

Language

location

6/23/2023  1:01:01 AM

1

Kermit

hosts

English

main stage

6/23/2023  1:10:01 AM

2

Animal

plays drums

Howling

main stage

6/23/2023  1:15:01 AM

3

Fozzy Bear

tells a joke

English

main stage

6/23/2023  1:20:01 AM

4

Rolf

plays piano

English

piano bar

6/23/2023  1:05:01 AM

5

Chef

throws fish

Swedish

kitchen

A unique hash is created, logged and the file is ingested.

On Day two, ingest creates a hash for muppets_6-23-2023.json and compares it to the existing hash. The file is the same, the hashes are the same. Muppets_6-23-2023.json is ignored.

On Day three, ingest evaluates muppets6-23-2023.json. This time the file is:

Timestamp

ID

Muppet Name

action

Language

location

6/23/2023  1:01:01 AM

1

Kermit

hosts

English

main stage

6/23/2023  1:10:01 AM

2

Animal

plays drums

Howling

main stage

6/23/2023  1:15:01 AM

3

Fozzy Bear

tells a joke

English

main stage

6/23/2023  1:20:01 AM

4

Rolf

plays piano

English

piano bar

6/23/2023  1:05:01 AM

5

Chef

throws fish

Swedish

kitchen

6/23/2023  1:25:01 AM

6

Gonzo

soars through the air

English

trapeze

6/23/2023  1:30:01 AM

4

Rolf

takes a nap

English

back stage

As another row has been added, the file size is different, the hash will be different and the file will be ingested.

Event Checking:

In this example, on the third day’s pass, only the new record will be ingested as event checking will evaluate each row for duplication, ignore the first 5 records, and ingest the last 2.

Timestamp

ID

Muppet Name

action

Language

location

6/23/2023  1:25:01 AM

6

Gonzo

soars through the air

English

trapeze

6/23/2023  1:30:01 AM

4

Rolf

takes a nap

English

back stage

Resources

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.