如何使用SAS、SQL或伪代码创建更准确的组ID



我有一个大数据集,其中一部分看起来如下。假设每组相同的KEY1(或KEY2(值代表相同的组:

YEAR    KEY1   KEY2
1998    218    883
1999    218    883
2000    218    883
2001    218    883
1995    300    633
1996    300    633
1997    300    633
2003    822    119
2004    822    119
2005    822    119
2005    992    476

我使用KEY1来表示每个组,但是有些组的KEY1不一致:

YEAR    KEY1    KEY2
2002    712     683
2003    344     683
2004    712     683

KEY2也可能不一致:

YEAR    KEY1    KEY2
2002    473     588
2003    473     588
2004    473     957

我正在治疗";344〃;以及";957";作为数据错误,以及KEY1和KEY2 中的所有其他不一致

我的目标是:创建一个名为GROUPID的变量,它在分组方面比只使用KEY1更准确

其他需要注意的事项:

  1. 年份并不总是连续的
  2. KEY1比KEY2更准确
  3. 对于同一组,KEY1和KEY2可能不一致
  4. 数据集很大,所以我无法手动搜索

可能的解决方案:

  1. 一个程序,它搜索整个数据集并返回KEY1的问题行,这样我就可以将KEY1复制到GROUPID,然后手动编辑它们
  2. 一个程序,可以在使用KEY1和KEY2之间切换,具体取决于哪个创建了一个更大的组。然后创建一个GROUPID来利用它

现在我只使用KEY1。对于上述任一解决方案,SAS、SQL或伪代码可能是什么样子?我对其他解决方案持开放态度。

这里有一个sql fiddle,它是使用MS sql Server的解决方案。对于大型数据集,索引临时表的使用将优于普通表表达式。然而,在时间允许的情况下,这只是可能改进的地方。

SQL Fiddle

MS SQL Server 2017架构设置:

CREATE TABLE MyData( Year INT, Key1 INT, Key2 INT )
INSERT INTO MyData VALUES
(1998,    218,    883),
(1999,    218,    883),
(2000,    218,    883),
(2001,    218,    883),
(1995,    300,    633),
(1996,    300,    633),
(1997,    300,    633),
(2003,    822,    119),
(2004,    822,    119),
(2005,    822,    119),
(2005,    992,    476),
--
(2002,    712,    683),
(2003,    344,    683),
(2004,    712,    683),
(2002,    473,    588),
(2003,    473,    588),
(2004,    473,    957),
--
(2002,    712,    222),
(2003,    344,    222),
(2004,    712,    222),
(2002,    473,    111),
(2003,    473,    111),
(2004,    473,    111)

查询1

;WITH DataInSomeOrder AS
(
SELECT Year,Key1,Key2, RowNumber = ROW_NUMBER() OVER(PARTITION BY 1 ORDER BY (SELECT 1))  
FROM MyData 
),
InstancesOfKeys AS
(
SELECT Year,Key1,Key2,Key1InstanceNumber,Key2InstanceNumber,RowNumber FROM (
SELECT Year,Key1,Key2,RowNumber,
Key1InstanceNumber = RANK() OVER(PARTITION BY Key1 ORDER BY RowNumber),
Key2InstanceNumber = RANK() OVER(PARTITION BY Key2 ORDER BY RowNumber)
FROM DataInSomeOrder
)AS X 
)
SELECT 
D.Year, D.Key1, D.Key2, D.RowNumber,
DuplicateKeyNumber = CASE WHEN K1.Year IS NULL THEN 2 ELSE 1 END   
FROM 
InstancesOfKeys D
LEFT JOIN InstancesOfKeys K1 ON K1.Key1=D.Key1 AND K1.Key2=D.Key2 AND K1.Key1InstanceNumber=1
WHERE
(D.Key1InstanceNumber<>D.Key2InstanceNumber) 
AND
(K1.Key1InstanceNumber IS NULL OR K1.Key1InstanceNumber<>K1.Key2InstanceNumber)
ORDER BY 
D.RowNumber

结果

| Year | Key1 | Key2 | RowNumber | DuplicateKeyNumber |
|------|------|------|-----------|--------------------|
| 2003 |  344 |  683 |        13 |                  1 |
| 2004 |  473 |  957 |        17 |                  2 |
| 2002 |  712 |  222 |        18 |                  2 |
| 2004 |  712 |  222 |        20 |                  2 |
| 2002 |  473 |  111 |        21 |                  2 |
| 2003 |  473 |  111 |        22 |                  2 |
| 2004 |  473 |  111 |        23 |                  2 |

不使用CTE的

SQL Fiddle

查询1

SELECT 
D.Year, D.Key1, D.Key2, D.RowNumber,
DuplicateKeyNumber = CASE WHEN K1.Year IS NULL THEN 2 ELSE 1 END   
FROM 
(
SELECT Year,Key1,Key2,Key1InstanceNumber,Key2InstanceNumber,RowNumber FROM (
SELECT Year,Key1,Key2,RowNumber,
Key1InstanceNumber = RANK() OVER(PARTITION BY Key1 ORDER BY RowNumber),
Key2InstanceNumber = RANK() OVER(PARTITION BY Key2 ORDER BY RowNumber)
FROM 
(
SELECT Year,Key1,Key2, RowNumber = ROW_NUMBER() OVER(PARTITION BY 1 ORDER BY (SELECT 1))  
FROM MyData
) SomeOrder  
)AS X 
)AS D
LEFT JOIN 
(
SELECT Year,Key1,Key2,Key1InstanceNumber,Key2InstanceNumber,RowNumber FROM (
SELECT Year,Key1,Key2,RowNumber,
Key1InstanceNumber = RANK() OVER(PARTITION BY Key1 ORDER BY RowNumber),
Key2InstanceNumber = RANK() OVER(PARTITION BY Key2 ORDER BY RowNumber)
FROM 
(
SELECT Year,Key1,Key2, RowNumber = ROW_NUMBER() OVER(PARTITION BY 1 ORDER BY (SELECT 1))  
FROM MyData
) SomeOrder  
)AS X 
)AS K1 ON K1.Key1=D.Key1 AND K1.Key2=D.Key2 AND K1.Key1InstanceNumber=1
WHERE
(D.Key1InstanceNumber<>D.Key2InstanceNumber) 
AND
(K1.Key1InstanceNumber IS NULL OR K1.Key1InstanceNumber<>K1.Key2InstanceNumber)
ORDER BY 
D.RowNumber

结果

| Year | Key1 | Key2 | RowNumber | DuplicateKeyNumber |
|------|------|------|-----------|--------------------|
| 2003 |  344 |  683 |        13 |                  1 |
| 2004 |  473 |  957 |        17 |                  2 |
| 2002 |  712 |  222 |        18 |                  2 |
| 2004 |  712 |  222 |        20 |                  2 |
| 2002 |  473 |  111 |        21 |                  2 |
| 2003 |  473 |  111 |        22 |                  2 |
| 2004 |  473 |  111 |        23 |                  2 |

您可以使用返回机器拨打2004年的"如何根据人们的名字或姓氏对他们进行分组;参见

给定一组对(key1,key2(,找到对,使得组中的每对都具有以下属性:

  • key1匹配组中任何其他对的key1
    OR
    key2与组中任何其他对的key2匹配

使用版本9哈希的迭代方式。两个散列维护分配给每个键值的groupId。另外两个散列用于维护组映射路径。当数据可以在不引起映射的情况下传递时,则组具有已完全确定。然后完成最后一次传递,在这一点上将groupId分配给每个一对

样本代码

%let seed =
%sysfunc(mod(%sysfunc(compress(%sysfunc(constant(e)),.)),2**31));
* random pair data;
data pairs;
do id = 1 to 1e4;
key1 = int (1e4*ranuni(&seed));
key2 = int (1e4*ranuni(&seed));
output;
end;
run;
/*
data pairs;
id + 1;
input key1 $ key2 $;
cards;
John Smith
George Smith
Bill Clinton
George Bush
;
data pairs;
id + 1;
input key1 key2 ;
format _numeric_ 4.;
cards;
1 2
3 2
4 5
3 6
4 2
;
*/
%let dbg = *;
data pairsWithGroupAssignments ;
declare hash one();
one.definekey ('key1');
one.definedata ('key1', 'groupid');
one.definedone();
declare hash two();
two.definekey ('key2');
two.definedata ('key2', 'groupid');
two.definedone();
declare hash map1();
map1.definekey ('from');
map1.definedata ('from', 'to');
map1.definedone();
declare hash map2();
map2.definekey ('from');
map2.definedata ('from', 'to');
map2.definedone();
_groupId = 0;
noMappings = 0;
do until (noMappings and outputDone);
doOutput = noMappings;
noMappings = 1;
do _n_ = 1 to numberOfPairs;
set pairs nobs=numberOfPairs point=_n_;
rc1 = one.find(); g1 = groupId;
rc2 = two.find(); g2 = groupId;
if doOutput then do;
output;
continue;
end;
&dbg.
put id= '(' key1 +(-1) ', ' key2 +(-1) ') ' @;
if rc1 ne 0 and rc2 ne 0 then do;
/** /
addboth:
/**/
_groupId + 1;
groupId = _groupId;
one.add ();
two.add ();
&dbg. put 'add ' key1= 'and ' key2= 'to ' groupId=;
end;
else
if rc1 ne 0 and rc2 = 0 then do;
/** /
add1:
/**/
groupId = g2;
one.add();
&dbg. put 'add ' key1= 'to ' groupId=;
end;
else
if rc1 = 0 and rc2 ne 0 then do;
/** /
add2:
/**/
groupId = g1;
two.add();
&dbg. put 'add ' key2= 'to ' groupId=;
end;
else
if g1 > g2 then do;
/** /
g1g2:
/**/
from = g1;
to = g2;
* determine groupid by following map1;
_to = to;
do while (map1.find(key:_to) = 0);
_to = to;
end;
from = g1;
map1.replace();
groupId = to;
one.replace();
&dbg. put 'add ' key1= 'to ' groupId= 'mapped from key1 group ' from;
noMappings = 0;
end;
else
if g2 > g1 then do;
/** /
g2g1:
/**/
from = g2;
to = g1;
* determine groupid by following map2;
to_ = to;
do while (map2.find(key:to_) = 0);
to_ = to;
end;
from = g2;
map2.replace();
groupId = to;
two.replace();
&dbg. put 'add ' key2= 'to ' groupId= 'mapped from key2 group ' from;
noMappings = 0;
end;
else do;
/** /
same:
/**/
&dbg. put rc1= rc2= g1= g2=;
end;
end;
nPass + 1;
outputDone = doOutput;
end;
put 'NOTE: Data iterated ' npass 'times.';
/*
two.output(dataset:'g2');
one.output(dataset:'g1');
map1.output(dataset:'map1');
map2.output(dataset:'map2');
*/
stop;
keep id key1 key2 groupId;
format _numeric_ 8.;
run;
proc sql noprint;
select count(distinct groupId) into :ngroups from &syslast;
%put &=ngroups;

在中讨论了通过网络跟踪关系或路径多年来的几个先前SAS-L listserv线程(<=2004(。

你可能还会发现https://www.devenezia.com/downloads/sas/samples#superset有趣的或有用的。

相关内容

  • 没有找到相关文章

最新更新