SAS中两组数据的比较



我有以下数据集:

data data_one;
length X 3
Y $ 20; 
input x y ;
datalines;
1 test
2 test
3 test1
4 test1
5 test
6 test
7 test1
    run;
data data_two;
length Z 3
       A $ 20;
input Z A;
datalines;
1 test
2 test1
3 test2
run;

我想要的是一个数据集,它告诉我data_one中的列Y在data_two中包含与列a相同的字符串的频率。结果应该如下所示:

 Obs    test    test1    test2
  1       4       3        0

提前感谢!

  1. 首先,我们需要data_one中出现的Y值的计数。
  2. 然后我们为data_two中存在的值创建一个排序(用于下一次合并)列表。
  3. data_one Y从1开始计数。与2中的列表合并。在data_two中存在但在data_one中不存在的Y值(b and not a)被分配count=0,在data_two中不存在的Y值被丢弃(if b)。
  4. 最后一段将垂直的计数列表转换为一组水平的变量。

proc freq data=data_one noprint;
    table y / out=count_one (keep=y count);
run;
proc sort data=data_two out=list_two (keep=a rename=(a=y)) nodupkey;
    by a;
run;
data count_all;
    merge count_one (in=a) list_two (in=b);
    by y;
    if (b and not a) then count=0;
    if b;
run;
proc transpose data=count_all out=final (drop=_name_ _label_);
    id y;
run;

前3个步骤可以替换为一个过程SQL:

proc sql;
    create table count_all as
    select distinct
            coalesce(t1.y,t2.a) as y,
            case
                when missing(t1.y) then 0 
                else count(t1.y)
            end as N
        from data_one as t1
        right join data_two as t2
            on t1.y=t2.a
        group by 1
        order by 1;
quit;
proc transpose data=count_all out=final (drop=_name_);
    id y;
run;

最新更新