Finding Duplicate Records in a Table Using Windowed Aggregates in SQL Server

Finding Duplicate Records in a Table

====================================================

When working with databases, it’s not uncommon to encounter duplicate records that need to be identified and addressed. In this article, we’ll explore how to find duplicate records based on two columns using SQL Server.

Understanding the Problem

Let’s consider an example table named employee with three columns: fullname, address, and city. The table contains several records, some of which are duplicates. For instance, there are multiple records with the same fullname and city.

We want to write a SQL query that returns only the duplicate records based on these two columns.

Reviewing the Current Query

The original query provided by the user attempts to achieve this using the GROUP BY clause:

SELECT fullname, city FROM employee GROUP BY fullname, city HAVING COUNT(*) > 1

However, this query has a flaw: it only returns duplicate records for the specified columns (fullname and city). Unfortunately, this results in an empty output set, as if there were no duplicates.

Alternative Approach Using Windowed Aggregates


Instead of relying on the GROUP BY clause, we can use windowed aggregations to access other columns in the table. This approach allows us to filter out non-duplicate records while preserving the original data.

The Solution

We’ll employ a Common Table Expression (CTE) to create a temporary view of the data, which includes the duplicate count for each record. We’ll then use this CTE to filter out non-duplicate records.

Here’s the modified query:

SELECT fullname,
       address,
       city
FROM   (SELECT *,
               COUNT(*) OVER (PARTITION BY fullname, city) AS cnt
        FROM   employee) e
WHERE  cnt > 1

This query works as follows:

  1. The subquery within the CTE selects all columns (*) from the employee table.
  2. We apply a windowed aggregation using the COUNT function with an OVER clause, which partitions the data by fullname and city. This produces a count of duplicate records for each group.
  3. The outer query filters out non-duplicate records by selecting only those rows where the cnt column is greater than 1.

How it Works

Let’s walk through an example to illustrate this process:

Suppose we have the following data in our employee table:

+---------+--------+-------+
| fullname | address | city   |
+=========+========+=======+
| AA      | address1| City1  |
| AA      | address3| City1  |
| AA      | address8| City2  |
| BB      | address5| City2  |
| BB      | address2| City1  |
| CC      | address6| City1  |
| CC      | address7| City2  |
| DD      | address4| City1  |
+---------+--------+-------+

When we apply the windowed aggregation:

SELECT fullname,
       address,
       city
FROM   (SELECT *,
               COUNT(*) OVER (PARTITION BY fullname, city) AS cnt
        FROM   employee) e
WHERE  cnt > 1

The resulting CTE looks like this:

+---------+--------+-------+-----+
| fullname | address | city   | cnt |
+=========+========+=======+=====+
| AA      | address1| City1  | 2   |
| AA      | address3| City1  | 2   |
| BB      | address5| City2  | 2   |
| BB      | address2| City1  | 2   |
| CC      | address6| City1  | 2   |
| CC      | address7| City2  | 2   |
+---------+--------+-------+-----+

Now, we filter out non-duplicate records by selecting only those rows where cnt is greater than 1:

SELECT fullname,
       address,
       city
FROM   (SELECT *,
               COUNT(*) OVER (PARTITION BY fullname, city) AS cnt
        FROM   employee) e
WHERE  cnt > 1

+---------+--------+-------+
| fullname | address | city   |
+=========+========+=======+
| AA      | address1| City1  |
| AA      | address3| City1  |
| BB      | address5| City2  |
| BB      | address2| City1  |
| CC      | address6| City1  |
| CC      | address7| City2  |
+---------+--------+-------+

The final result set contains only the duplicate records, as expected.

Conclusion

In this article, we’ve explored how to find duplicate records in a table based on two columns using SQL Server. By employing windowed aggregations and filtering out non-duplicate records, we can efficiently identify duplicates while preserving the original data. This approach is particularly useful when working with large datasets or when you need to perform complex filtering operations.


Last modified on 2024-07-24