Deleting duplicate records
Sometimes it is necessary to remove duplicate, or partial duplicate records from a table. Usually this happens when a user has to go back and normalize data that wasn't previously done. It can be because of inexperience, or sometimes, when taking over a job from a different developer.
The simplest way is to add a unique index across all the columns of your table and then drop the index later. This is good when you don't care which row of a group is saved. Again you should do a back-up of your data for safety, you can't undelete rows!
ALTER IGNORE TABLE cars
ADD PRIMARY KEY deletelater(make, model, year);
Your primary key index now only allows one row for each make, model and year in your table. You can drop the index if you wanted to. You might do that because you need to change the structure of your table and then add a different primary key.
ALTER TABLE cars DROP INDEX deletelater
What if though, you did care which rows were saved? You might have an id column that was added later, or might go by some other column. Let's look at our cars table again and only keep one of each make and model from each model year. Perhaps it is an inventory of cars sold, but now we only care about which year was the last sold for each make and model. We certainly don't need three Ford Tauruses from 1999, nor do we need the older years of that car.
This method uses a co-related sub-query to get the latest record per group. If you don't know how to do that then read this article.
CARS
| Make | Model | Year |
| Ford | Taurus | 1999 |
| Ford | Taurus | 1999 |
| Ford | Taurus | 1999 |
| Ford | Taurus | 1998 |
| Ford | Taurus | 1998 |
| Ford | Taurus | 1997 |
| Ford | Taurus | 1996 |
| Chevy | S-10 | 2005 |
| Chevy | S-10 | 2004 |
| Chevy | S-10 | 2003 |
| Chevy | S-10 | 2002 |
| Chevy | S-10 | 2001 |
| Chrysler | Concorde | 2003 |
| Chrysler | Concorde | 2002 |
| Chrysler | Concorde | 2000 |
The three rows that we want to retrieve from the table:
Ford Taurus 1999
Chevy S-10 2005
Chrysler Concorde 2003
Don't run this query as is, I'm just showing it to you to refresh your memory. The full explanation on how it works is in the above article. Note this will delete all rows except for those three so do wait until we modify the query.
SELECT
c1.make,
c1.model,
c1.year
FROM cars as c1
WHERE c1.year = (SELECT MAX(year)
FROM cars
WHERE make=c1.make
AND model=c1.model)
For data back-up purposes it is always advisable to run such a query into a temp table, examine the rows in the new table and make sure they are what you want. Then rename the old table for deletion later and then rename the temp table back to the original name.
When making a temporary table you can do it on-the-fly like so:
CREATE TEMPORARY TABLE safedata
SELECT
c1.make,
c1.model,
c1.year
FROM cars as c1
WHERE c1.year = (SELECT MAX(year)
FROM cars
WHERE make=c1.make
AND model=c1.model)
Now check your data, rename your old table, rename the temp table and then delete the old table. You can drop the old table without renaming it, you just might want to retrieve data later so I always rename then drop.
ALTER TABLE oldtablename RENAME dropthistable;
ALTER TABLE safedata oldtablename;
DROP TABLE dropthistable;