Deduplication

Product Data Consolidation

Duplicate products in your feed can cause issues with accuracy and data consumption. SmartFeeds provides a deduplication module to help remove duplicates based on specific keys, ensuring high-quality, unique product feeds. This guide also addresses the risks of duplication at various stages, including preprocessing, joins, and post-join mapping.

This article is ideal for users looking to improve feed quality, understand potential sources of duplication, and leverage advanced SQL techniques for handling duplicates.

Prerequisites

  • A basic understanding of product flow structure and keys.
  • (Optional) Knowledge of SQL for advanced deduplication methods.

Table of Contents

  1. Understanding and Preventing Duplication
  2. Using the Deduplication Module
  3. Advanced Deduplication Using SQL
  4. Quick Answers to Common Issues
  5. Related Articles

Understanding and Preventing Duplication

Sources of Duplication

Duplication in your feed can occur at various stages:

  • Preprocessing Stage:
    • Duplication in the source file itself.
    • Mismatched or incomplete identifiers causing records to be treated as duplicates.

You can verify and clean data before importing it into SmartFeeds.

  • Join Stage:
    • Incorrect joins can create duplicate rows if the wrong key or join model is used.
    • Ensure you use the correct join key and consider adding a second column if necessary to create a unique identifier.

For more information on joins, refer to the Join article.

  • Post-Join Stage:
    • If the deduplication module is not used, duplicates may persist in the final joined file.
    • SmartFeeds includes an automatic check to detect duplicates in the final joined file. Here's how it works:
      • If no duplicates are found: Your feed is ready for mapping, and no further action is needed.
      • If duplicates are found: You should ensure they are intentional and, if not, use the deduplication module to clean your data accordingly
    • Performing this step ensures a clean and unique data set, creating a solid base for accurate mapping and channel export.

Key Considerations

  • Avoid working with duplicates at any stage: They can propagate errors across your feed.
  • Choose the correct deduplication key: This should uniquely identify products (e.g., ID, SKU, or EAN).

Using the Deduplication Module

Step 1: Select the Key

  • Open the Deduplication Module in your product flow.
  • Choose a key (e.g., ID, SKU, or EAN) to base the deduplication process on.
  • If duplicates are found, a message will appear. Click "See More" to review the results.

Step 2: Analyze the Deduplication Screen

  • Green Rows: Products retained after deduplication.
  • Duplicate Rows: Products marked for removal.

Step 3: Configure Sorting and Retention Rules

  • Select one or more keys to determine which product to retain.
  • Sort the data in ascending or descending order to prioritize products.

Example:

To retain the product with the highest price:

  1. Select the Price key.
  2. Sort by ascending order.

Advanced Deduplication Using SQL

For advanced users, SQL offers flexibility in deduplication:

Step 1: Create a Variable

  • Go to Mapping and select any field.
  • Open the Advanced Formula Editor and create a variable using the following formula:
ROW_NUMBER() OVER (PARTITION BY import.id_column)  
  • Replace import with your file name.
  • Replace id_column with the column used as the unique product identifier (e.g., ID, EAN, or GTIN).

This assigns a unique number to each duplicate product.

Step 2: Apply a Filter

  • Navigate to the Filters section.
  • Add a condition to retain only the first instance of each product:
variable_duplicates = 1
  • Replace variable_duplicates with the name of your variable.

This ensures that only the first version of each product is pushed to your feed.
 


Quick Answers to Common Issues

Can I run deduplication directly on my import or channel?

  • For imports: Not yet. You can contact our team to request preprocessing setup for handling duplicates.
  • For channels: Generally, if you’re using the deduplication module on the joined file, the issue is resolved before reaching the channel.

How can I check for duplicates without entering the Join page?

Compare the number of rows in your import with the number of rows in your channel. A mismatch could indicate duplication.

I updated my join, but I don’t see the changes reflected in my product count. Why?

Remember to fetch your channel to apply and update any changes made in your joins.


Related Articles

For further assistance with SmartFeeds, consider reviewing these articles:

For additional help, feel free to reach out via our Contact Us page.

Was this article helpful?