T-SQL Recipes之删除重复行
准备基础数据
(1)创建辅助表,方便以后倾向于Set-base方式解决问题
-- Creating and Populating the Nums Auxiliary Table SET NOCOUNT ON; IF OBJECT_ID('dbo.Nums', 'U') IS NOT NULL DROP TABLE dbo.Nums; CREATE TABLE dbo.Nums(n INT NOT NULL PRIMARY KEY); DECLARE @max AS INT , @rc AS INT; SET @max = 1000000; SET @rc = 1; INSERT INTO dbo.Nums ( n ) VALUES ( 1 ); WHILE @rc * 2 <= @max BEGIN INSERT INTO dbo.Nums ( n ) SELECT n + @rc FROM dbo.Nums; SET @rc = @rc * 2; END INSERT INTO dbo.Nums ( n ) SELECT n + @rc FROM dbo.Nums WHERE n + @rc <= @max; GO
(2) 准备一些重复数据,这里使用AdventureWorks2014
PS:造成重复数据的原因有很多种,比如没有设置主键,批量导入的时候可能被执行了多次,归根到底还是人为原因。
USE AdventureWorks2014; GO IF OBJECT_ID('dbo.OrdersDups') IS NOT NULL BEGIN DROP TABLE dbo.OrdersDups; END SELECT SalesOrderID , CustomerID , SalesPersonID , OrderDate INTO dbo.OrdersDups FROM Sales.SalesOrderHeader CROSS JOIN dbo.Nums WHERE n <= 3;
如图,我们有了许多重复数据,现在就要考虑如何去重了。
解决方案
这里小陈提供三种去重方案,每一种方案都有相似性,也有独特之处。
1. 使用ROW_NUMBER() 函数去重,这种方式适合于比例小的重复数据。
USE AdventureWorks2014; GO WITH Dups AS ( SELECT SalesOrderID , CustomerID , SalesPersonID , OrderDate , ROW_NUMBER() OVER ( PARTITION BY SalesOrderID ORDER BY ( SELECT0) ) AS rn FROM dbo.OrdersDups) DELETE FROM Dups WHERE rn > 1;
在未删除之前,先看一下前10行的数据是什么模样.
2. 使用ROW_NUMBER() + RANK() 函数去重,这种方式跟第一种类似,也是适合比例小的重复数据。
USE AdventureWorks2014; GO WITH Dups AS ( SELECT SalesOrderID , CustomerID , SalesPersonID , OrderDate , ROW_NUMBER() OVER ( PARTITION BY SalesOrderID ORDER BY ( SELECT 0) ) AS rn, RANK() OVER(PARTITION BY SalesOrderID ORDER BY (SELECT 0)) AS rnk FROM dbo.OrdersDups) DELETE FROM Dups WHERE rn > 1;
3. 删除大量的重复行时,会导致日志行为,最终DELETE会执行的很慢。所以这里推荐的做法是复制唯一行到新表,减少日志操作。然后移除老表,把新表的名字重命名为老表。
USE AdventureWorks2014; GO WITH Dups AS ( SELECT SalesOrderID , CustomerID , SalesPersonID , OrderDate , ROW_NUMBER() OVER ( PARTITION BY SalesOrderID ORDER BY ( SELECT 0) ) AS rn FROM dbo.OrdersDups) SELECT SalesOrderID , CustomerID , SalesPersonID , OrderDate INTO dbo.OrdersDupsTmp FROM Dups WHERE rn = 1; DROP TABLE dbo.OrdersDups; EXEC sp_rename 'dbo.OrdersDupsTmp', 'OrdersDups';