Understanding Almost Duplicates in SQL Results
In a recent Stack Overflow question, a user was struggling to identify and remove “almost duplicate” rows from their SQL results. The issue arose when a USPS address match process created new fields with slightly different abbreviations, causing the query to produce duplicate or near-duplicate records.
This article aims to provide an in-depth exploration of this problem, including a step-by-step guide on how to identify and remove almost duplicates using a combination of SQL techniques, data manipulation, and logic-based approaches.
Background and Context
The original question provided a SQL query that joins multiple tables to retrieve address information for individuals. The query also includes a delete statement to eliminate rows where the current address differs from the previous address in all three fields (address1, address2, and address3). However, due to the USPS match process, new fields were introduced with varying abbreviations, making it challenging to remove duplicates using unique field combinations.
Identifying Almost Duplicates
To begin solving this problem, we need to identify almost duplicate rows. One approach is to group the substrings in both Old Address and New Address by count of those substrings. Rows where the counts equal each other at the row level can be considered as having the same address structure.
We’ll use a combination of SQL functions, such as CHARINDEX, LEFT, RIGHT, and SUBSTRING, to split the addresses into their constituent parts (street number, street name, and suffix).
-- Create a lookup table for abbreviations
CREATE TABLE lookup_abbreviations (
unabbreviated_name varchar(50),
abbreviated_name varchar(50)
);
INSERT INTO lookup_abbreviations(unabbreviated_name, abbreviated_name)
VALUES ('East', 'E')
INSERT INTO lookup_abbreviations(unabbreviated_name, abbreviated_name)
VALUES ('Street', 'St');
-- Split the addresses into their constituent parts
SELECT DISTINCT
Old_Street_Nbr = SUBSTRING(Old_Address, CHARINDEX(' ', Old_Address))
Old_Street_Nm_Prefix = CASE WHEN /*Here is where the count of substrings is tested*/ END
Old_Street_Nm = CASE WHEN /*Here is where the count of substrings is tested*/ END
Old_Street_Suffix = []
INTO #AbbreviatonSort
FROM Results;
Grouping and Counting Substrings
To determine if two addresses have equal substring counts, we can use a CASE statement to compare the count values.
-- Compare the count of substrings
SELECT
Old_Street_Nbr,
Old_Street_Nm_Prefix = CASE WHEN Old_Street_Nm_Prefix IN (SELECT abbreviated_name from lookup_abbreviations) THEN (SELECT unabbreviated_name from lookup_abbreviations WHERE abbreviated_name = Old_Street_Nm_Prefix) ELSE Old_Street_Nm_Prefix END
INTO #SortedAddresses
FROM #AbbreviationSort;
Unifying Address Parts
To unify the address parts, we can use a combination of UNION ALL and SELECT DISTINCT. This will ensure that each row has only one set of unified address parts.
-- Unified address parts
SELECT DISTINCT *
FROM (
SELECT Old_Street_Nbr, Old_Prefix FROM #SortedAddresses
UNION ALL
SELECT New_Street_Nbr, New_Prefix FROM #SortedAddresses
) AS DupSearch;
Removing Almost Duplicates
With the unified address parts in hand, we can now remove almost duplicates. We’ll use a combination of GROUP BY and HAVING to filter out rows with equal substring counts.
-- Remove almost duplicates
SELECT DISTINCT *
FROM (
SELECT Old_Street_Nbr, New_Street_Nbr, Old_Prefix, New_Prefix,
COUNT(*) OVER (PARTITION BY Old_Street_Nbr, New_Street_Nbr) AS count_value
FROM DupSearch
) AS AlmostDuplicates
WHERE count_value > 1;
Conclusion
In this article, we explored the problem of almost duplicates in SQL results and provided a step-by-step guide on how to identify and remove them using a combination of SQL techniques, data manipulation, and logic-based approaches. By grouping and counting substrings, unifying address parts, and removing duplicates based on equal substring counts, we can ensure that our query produces clean and accurate results.
Last modified on 2025-04-30