如何在 SAS 中对同时包含字符变量和数值变量的表进行聚类分析?


Account_id <- c("00qwerf1”, “00uiowe3”, “11heooiue” , “11heooihe” , 
"00sdffrg3”, “03vthjygjj”, “11mpouhhu” , “1poihbusw”)
Postcode <- c(“EN8 7WD”, “EN7 9BB”, “EN6 8YQ”, “EN8 7TT”, “EN7 9BC”, “EN6 
8YQ”, “EN8 7WD”, “EN7 7WB) 
Age <- c(“30”, “35”, “40”, “50”, “60”, “32”, “34”, “45”)
DF <- data.frame(Account_id, Postcode, Age)

我想在 SAS 中对数据帧进行聚类分析。我知道从技术上讲,SAS中不使用数据帧,但是我只是将这种格式用于说明目的。Account_idPostcode都是字符变量,Age是数值变量。

下面是我在执行数据步骤后使用的代码;

Proc fastclus data=DF maxc-8 maxiter=10 seed=5 out=clus;
Run;

聚类分析不起作用,因为Account_idPostcode是字符变量。有没有办法将这些变量更改为数值变量,或者是否有一种同时适用于字符变量和数值变量的聚类方法?

在执行聚类分析之前,需要定义一个可用于计算观测值之间距离的指标。默认情况下,proc fastclus使用欧几里得度量。这要求所有输入变量都是数值,并且如果将它们全部重新缩放为具有相同的均值和方差,则效果最佳,以便在增长聚类时它们都同样重要。

如果要对每个邮政编码执行单独的聚类分析,则可以在by语句中使用 postcode,但如果要将 postcode 本身用作聚类变量,则需要将其转换为数字形式。将邮政编码替换为邮政编码质心的纬度和经度的两个变量可能是一个不错的选择。

对于您的账户 ID 变量来说,什么是好的选择不太明显,因为这似乎不是任何东西的度量。我会尝试掌握其他内容,例如帐户创建日期或上次活动日期,可以以更明显的方式将其转换为数值。

您可以确定每个变量的唯一值,然后将原始值的序数分配为数字表示形式,以便fastclus

示例代码

注意:FASTCLUS seed= 选项是一个数据集说明符,而不是一个简单的数字(如与随机数生成器一起使用(

* hacky tweak to place your R coded data values in a SAS data set;
data have;
array _Account_id(8) $20 _temporary_ ("00qwerf1", "00uiowe3", "11heooiue" , "11heooihe" , 
"00sdffrg3", "03vthjygjj", "11mpouhhu" , "1poihbusw");
array _postcode(8) $7 _temporary_ ("EN8 7WD", "EN7 9BB", "EN6 8YQ", "EN8 7TT", "EN7 9BC", "EN6 
8YQ", "EN8 7WD", "EN7 7WB");
array _age (8) $3  _temporary_ ("30", "35", "40", "50", "60", "32", "34", "45");
do _n_ = 1 to dim (_account_id);
Account_id = _account_id(_n_);
Postcode = _postcode(_n_);
Age = _age(_n_);
output;
end;
run;
* get lists of distinct values for each variable;
proc means noprint data=have;
class _all_;
ways 1;
output out=have_freq;
run;
* compute ordinal of each variables original value;
data have_freq2;
set have_freq;
if not missing(Account_id) then unum_Account_id + 1;
if not missing(Postcode) then unum_Postcode + 1;
if not missing(Age) then unum_Age + 1;
run;
* merge back by original value to obtain ordinal values;
proc sql;
create table have_unumified as
select 
Account_id, Postcode, Age
, (select unum_Account_id from have_freq2 where have_freq2.Account_id = have.Account_id) as unum_Account_id
, (select unum_Postcode   from have_freq2 where have_freq2.Postcode = have.Postcode) as unum_Postcode
, (select unum_Age   from have_freq2 where have_freq2.Age = have.Age) as unum_Age
from have
;
run;
* fastclus on the ordinal values (seed= not specified);
Proc fastclus data=have_unumified maxc=8 maxiter=10 out=clus_on_unum;
var unum_:;
Run;

最新更新