在SQL中,如何将两行折叠成一行?



假设我有以下表格:

<表类> C1 C2 C3 C4 tbody><<tr>奥尔顿詹姆斯网AltonJamesWebs奥尔顿网堵塞AltonJamsWebsBuddarakhIzme烧烤BuddarakhGrillIzmeBuddarakhGriIzmezhBuddarakhGriIzmezh

将示例数据转换为DDL/DML:

DECLARE @Table TABLE (C1 NVARCHAR(20), C2 NVARCHAR(20), C3 NVARCHAR(20), C4 NVARCHAR(20));
INSERT INTO @Table (C1, C2, C3, C4) VALUES
('Alton     ', 'James   ', 'Webs    ', 'AltonJamesWebs    '),
('Alton     ', 'Webs    ', 'Jams    ', 'AltonJamsWebs     '),
('Buddarakh ', 'Izme    ', 'Grill   ', 'BuddarakhGrillIzme'),
('Buddarakh ', 'Gri     ', 'Izmezh  ', 'BuddarakhGriIzmezh'),
('Buddarakh ', 'Gric    ', 'Izmezh  ', 'BuddarakhGriIzmezh');

我们可以执行自连接,但首先我们要分配一些行号,以便稍后跟踪行:

;WITH nowWithRowNumber AS (
SELECT t.C1, t.C2, t.C3, t.C4, ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY c2, c3, c4) AS rn
FROM @Table t
)
SELECT t.C1, t.C2, t.c3, t.C4, t2.C2 AS C2_2, t2.C3 AS C3_2, t2.C4 AS C4_2, t2.rn
FROM nowWithRowNumber t
INNER JOIN nowWithRowNumber t2
ON t.C1 = t2.C1
AND t2.rn <> 1
AND (
t.c2 <> t2.c2
OR t.c3 <> t2.c3
) 
WHERE t.rn = 1
C1              C2          c3          C4                  C2_2        C3_2        C4_2                rn
----------------------------------------------------------------------------------------------------------
Alton           James       Webs        AltonJamesWebs      Webs        Jams        AltonJamsWebs       2
Buddarakh       Gri         Izmezh      BuddarakhGriIzmezh  Gric        Izmezh      BuddarakhGriIzmezh  2
Buddarakh       Gri         Izmezh      BuddarakhGriIzmezh  Izme        Grill       BuddarakhGrillIzme  3

这假设了您需要确认或调优的逻辑,这些行应该基于c1列匹配而其他列不匹配的事实进行连接,并且这些行应该在c1上进行分区,并在c2、c3、c4上排序。

试着理解你的问题的更广泛的背景,我认为这是一个x,y问题。根据我的经验,当我想要计算Levenshtein距离时,我一直试图找到重复的行,并且一旦找到它们,我总是想对它们做一些事情。将它们转到列中实际上会使进一步的处理变得非常困难。因此,我将通过保持行原样来解决这个问题,但将它们与找到的C1组中的第一个重复项进行匹配。这也处理尽可能多的潜在的重复-尽管公平地说,这是相当简单的逻辑。

DECLARE @Table TABLE (Id int, C1 nvarchar(20), C2 nvarchar(20), C3 nvarchar(20), C4 nvarchar(20));
INSERT INTO @Table (Id, C1, C2, C3, C4) VALUES
(1, 'Alton', 'James', 'Webs', 'AltonJamesWebs'),
(2, 'Alton', 'Webs', 'Jams', 'AltonJamsWebs'),
(3, 'Buddarakh', 'Izme', 'Grill', 'BuddarakhGrillIzme'),
(4, 'Buddarakh', 'Gri', 'Izmezh', 'BuddarakhGriIzmezh'),
(5, 'Buddarakh', 'Gric', 'Izmezh', 'BuddarakhGriIzmezh');
WITH cte1 AS (
-- First find the row number within the C1 group
SELECT *
, ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY Id) rn 
FROM @Table
), cte2 AS (
-- Second using lag for all but the first row, lag back using rn to the
-- first row in the C1 group
SELECT *
, CASE WHEN rn > 1 THEN LAG(Id, rn-1, null) OVER (PARTITION BY C1 ORDER BY Id) ELSE NULL END baseId
, CASE WHEN rn > 1 THEN LAG(C2, rn-1, null) OVER (PARTITION BY C1 ORDER BY Id) ELSE NULL END baseC2
, CASE WHEN rn > 1 THEN LAG(C3, rn-1, null) OVER (PARTITION BY C1 ORDER BY Id) ELSE NULL END baseC3
, CASE WHEN rn > 1 THEN LAG(C4, rn-1, null) OVER (PARTITION BY C1 ORDER BY Id) ELSE NULL END baseC4
FROM cte1
)
SELECT Id
, C1, C2, C3, C4
, baseId, baseC2, baseC3, baseC4
-- Some function to calculate Levenshtein Distance
, dbo.LevenshteinDistance(baseC4, C4) LevenshteinDistance
FROM cte2;

这回报:

<表类>IdC1C2C3C4baseIdbaseC2baseC3baseC4tbody><<tr>1奥尔顿詹姆斯网AltonJamesWebs空空空空2奥尔顿网堵塞AltonJamsWebs1詹姆斯网AltonJamesWebs3BuddarakhIzme烧烤BuddarakhGrillIzme空空空空4BuddarakhGriIzmezhBuddarakhGriIzmezh3Izme烧烤BuddarakhGrillIzme5BuddarakhGricIzmezhBuddarakhGriIzmezh3Izme烧烤BuddarakhGrillIzme

最新更新