Atlanta Custom Software Development 

 
   Search        Code/Page
 

User Login
Email

Password

 

Forgot the Password?
Services
» Web Development
» Maintenance
» Data Integration/BI
» Information Management
Programming
  Database
Automation
OS/Networking
Graphics
Links
Tools
» Regular Expr Tester
» Free Tools

Deleting Duplicate Records

Total Hit ( 2178)

Rate this article:     Poor     Excellent 

 Submit Your Question/Comment about this article

Rating


 


May of us have faced duplicate records issue in SQL Server Table. First, I'll need some duplicates to work with. I use following script to create a table called dup_authors in the pubs database.

Click here to copy the following block
-- If the table exists, drop it
use pubs
go
If exists (select * from INFORMATION_SCHEMA.Tables where TABLE_NAME = 'dup_authors')
 drop table dup_authors

-- put the records in the database
select au_lname, au_fname, city, state
into dup_authors
from authors

-- duplicate the CA records
insert dup_authors
select au_lname, au_fname, city, state
from authors
where state = 'CA'

-- duplicate the KS record, twice
insert dup_authors
select au_lname, au_fname, city, state
from authors
where state = 'KS'

insert dup_authors
select au_lname, au_fname, city, state
from authors
where state = 'KS'

-- identify the duplicated records
select au_lname, au_fname, city, state, count(*)
from dup_authors
group by au_lname, au_fname, city, state
order by count(*) desc, au_lname

It selects a subset of the columns and creates some duplicate records. At the end it runs a SELECT statement to identify the duplicate records:

Click here to copy the following block
select au_lname, au_fname, city, state, count(*)
from dup_authors
group by au_lname, au_fname, city, state
having count(*) > 1
order by count(*) desc, au_lname, au_fname

The easiest way I know of to identify duplicates is to do a GROUP BY on all the columns in the table. It can get a little cumbersome if you have a large table. My duplicates look something like this:

au_lname    au_fname  city         state      
--------------- ---------- -------------------- ----- -----------
Smith      Meander  Lawrence       KS  3
Bennet     Abraham  Berkeley       CA  2
Carson     Cheryl   Berkeley       CA  2
except there are thirteen additional duplicates identified.

Second, backup your database. Third, make sure you have a good backup of your database.

Temp Table and Truncate

The simplest way to eliminate the duplicate records is to SELECT DISTINCT into a temporary table, truncate the original table and SELECT the records back into the original table. That query looks like this:

Click here to copy the following block
select distinct *
into #holding
from dup_authors

truncate table dup_authors

insert dup_authors
select *
from #holding

drop table #holding

If this is a large table, it can quickly fill up your tempdb. This also isn't very fast. It makes a copy of your data and then makes another copy of your data. Also while this script is running, your data is unavailable. It may not be the best solution but it certainly works.

Rename and Copy Back

The second option is to rename the original table to something else, and copy the unique records into the original table. That looks like this:

Click here to copy the following block
sp_rename 'dup_authors', 'temp_dup_authors'

select distinct *
into dup_authors
from temp_dup_authors

drop table temp_dup_authors

This has a couple of benefits over the first option. It doesn't use tempdb and it only makes one copy of the data. On the downside, you'll need to rebuild any indexes or constraints on the table when you're done. This one also makes the data unavailable during the process.

Create a Primary Key

Our last option is more complex. It has the benefit of not making a copy of the data and only deleting the records that are duplicates. It's main drawback is that we have to alter the original table and add a sequential record number field to uniquely identify each record. That script looks like this:

Click here to copy the following block
-- Add a new column
-- In real life I'd put an index on it
Alter table dup_authors add NewPK int NULL
go

-- populate the new Primary Key
declare @intCounter int
set @intCounter = 0
update dup_authors
SET @intCounter = NewPK = @intCounter + 1

-- ID the records to delete and get one primary key value also
-- We'll delete all but this primary key
select au_lname, au_fname, city, state, RecCount=count(*), PktoKeep = max(NewPK)
into #dupes
from dup_authors
group by au_lname, au_fname, city, state
having count(*) > 1
order by count(*) desc, au_lname, au_fname

-- delete dupes except one Primary key for each dup record
delete    dup_authors
from    dup_authors a join #dupes d
on    d.au_lname = a.au_lname
and    d.au_fname = a.au_fname
and    d.city = a.city
and    d.state = a.state
where    a.NewPK not in (select PKtoKeep from #dupes)

-- remove the NewPK column
ALTER TABLE dup_authors DROP COLUMN NewPK
go

drop table #dupes

It's actually possible to combine the SELECT INTO #dupes and the DELETE into one DELETE statement. My script is easier to read and understand and shouldn't be much slower. This will run a single delete statement against your table and only remove the duplicate records. If you have a large table the join statement can get kind of large. I guess that's the price you pay for letting duplicates into your database.


Submitted By : Nayan Patel  (Member Since : 5/26/2004 12:23:06 PM)

Job Description : He is the moderator of this site and currently working as an independent consultant. He works with VB.net/ASP.net, SQL Server and other MS technologies. He is MCSD.net, MCDBA and MCSE. In his free time he likes to watch funny movies and doing oil painting.
View all (893) submissions by this author  (Birth Date : 7/14/1981 )


Home   |  Comment   |  Contact Us   |  Privacy Policy   |  Terms & Conditions   |  BlogsZappySys

© 2008 BinaryWorld LLC. All rights reserved.