首先.最后.多次发生



我正试图使用SAS的first.variable和last.variable从数据集中收集日期。以下是创建可复制示例的代码:

data example;
infile datalines delimiter = ",";
input id $ code $ valid_from valid_to;
format valid_from IS8601DA10. valid_to IS8601DA10.;
datalines;
1A,ABC,20058,20177
1A,DEF,20178,20481
1A,DEF,20482,20605
1A,DEF,20606,21548
1A,DEF,21549,21638
1A,DEF,21639,21729
1A,ABC,21730,21733
1A,ABC,21734,21808
1B,MNO,20200,20259
1B,PQR,20260,20269
1B,STU,20270,20331
1B,VWX,20332,20361
1B,VWX,20362,22108
1B,VWX,22109,22164
1B,VWX,22165,22165
1B,VWX,22166,2936547
;
run;

其想法是,对于每个id,每个code只得到一个观测值及其所涵盖的相应日期范围。

这是我的代码:

proc sort data=example out=example_sorted; by code valid_from; run;
data collapse_val_dates; 
set example_sorted;
by code valid_from;
if first.code = 1 and last.code = 1 then do;
output;
end;
if first.code = 1 and last.code = 0 then do;
hold = valid_from;
retain hold;
end;
if first.code = 0 and last.code = 1 then do;
valid_from = hold;
output;
end;
drop hold;
run;

以下是结果(表collapse_val_dates(:

+----+------+------------+------------+
| id | code | valid_from |  valid_to  |
+----+------+------------+------------+
| 1A | ABC  | 2014-12-01 | 2019-09-16 |
| 1A | DEF  | 2015-03-31 | 2019-06-29 |
| 1B | MNO  | 2015-04-22 | 2015-06-20 |
| 1B | PQR  | 2015-06-21 | 2015-06-30 |
| 1B | STU  | 2015-07-01 | 2015-08-31 |
| 1B | VWX  | 2015-09-01 | 9999-12-31 |
+----+------+------------+------------+

它产生了我对id=1B的期望,而不是对id=1A的期望。事实上,由于code=ABC在开始时出现一次,在结束时出现两次,因此结果表将valid_from=2014-12-01放入。

我想要的是valid_from,而code=ABC2019-06-30。换句话说,我希望SAS";忘记";如果中间有一个(或多个(其他代码,则该代码的第一次出现。最后的表格是这样的:

+----+------+------------+------------+
| id | code | valid_from |  valid_to  |
+----+------+------------+------------+
| 1A | DEF  | 2015-03-31 | 2019-06-29 |
| 1A | ABC  | 2019-06-30 | 2019-09-16 |
| 1B | MNO  | 2015-04-22 | 2015-06-20 |
| 1B | PQR  | 2015-06-21 | 2015-06-30 |
| 1B | STU  | 2015-07-01 | 2015-08-31 |
| 1B | VWX  | 2015-09-01 | 9999-12-31 |
+----+------+------------+------------+

对于数据的单次传递,在串行处理组时不能输出code的日期范围,因为组中稍后的行可能会覆盖所需的日期范围。

您需要选择

  • 编码多个步骤,或者
  • 执行单次传递并使用临时存储

假定haveid valid_from排序,并且valid_to从不与id组内的后续valid_from重叠。

多步骤

计算由连续code分组的行的group number,以用于最终排序。

* multi step way;
data stage1;
set have;
by id code notsorted;
if first.code then group_number+1;
run;
proc sort data=stage1 out=stage2;
by id code group_number valid_from;
run;
* remember there can be multiple contiguous code groups within id & group;
data stage3;
do until (last.code);
set stage2;
by id code group_number;
if first.group_number then _start = valid_from;
if last.code then do;
valid_from = _start;
OUTPUT;                /* date range for contiguous code group */
end;
end;
drop _start;
run;
proc sort data=stage3 out=want(drop=group_number);
by id valid_from;
run;

单程

DOW循环(其中包含SET语句的循环(可以在组和子群上计算结果,并为每个组合输出一行。临时存储可以是散列(对于任意数量的子组(,也可以是假设最大子组数量的数组。

示例:

固定大小1,000的临时数组用于存储在检查组时应修改的临时数据。

* find the range of the dates from the last set of contiguous rows of a code within id;
data want(keep=id code valid_:);
array dates (1000,2)   8 _temporary_; /* ,1 for _from and ,2 for _to */
array codes (1000)   $50 _temporary_;
array seq   (1000)     8 _temporary_; /* sequence for output order */
* process the id group;
do _n_ = 1 by 1 until (last.id);
set have;
by id code notsorted;
* save start of date range in temporay storage;
if first.code then do;
* linear search for slot to use for subgroup code;
do _index = 1 by 1 
until (missing(codes(_index)) or codes(_index)=code);
end;
codes(_index) = code;
dates(_index,1) = valid_from;
seq  (_index) = _n_ + _index / 1000; * encode order value with lookup index;
end;
* save end of date range;
if last.code then 
dates(_index,2) = valid_to;
end;
*---;
* process each code within id;
call sort (of seq(*)); * order of first date for last code subgroup;
do _index = 1 to dim(seq);
if missing(seq(_index)) then continue;
* extract encoded information;
_ix = round((seq(_index) - int(seq(_index))) * 1000);
code = codes(_ix);
valid_from = dates(_ix,1);
valid_to   = dates(_ix,2);
OUTPUT;
end;
* clear out temporary arrays for next group processing;
call missing (of dates(*), of codes(*), of seq(*));
run;

相关内容

  • 没有找到相关文章

最新更新