我需要编写一个R代码,这将执行以下操作:
- 使用循环通过列
- 用逗号将每个值分开,并将它们分配到变量
- 将该变量中的值与现有列名称进行比较
- 如果不存在列名,请创建一个新列,每个comma分隔值
- 将" 1"填充到该新列的观察中
- 如果存在列名,请在现有列中添加" 1",并使用该名称
在操纵之前的数据(列)看起来像:
jobTitle
1 <NA>
2 <NA>
3 <NA>
4 Functional Architect, Business Technology
5 <NA>
6 <NA>
7 <NA>
8 <NA>
9 <NA>
10 Founder and President
11 Product Manager
12 <NA>
13 <NA>
14 <NA>
15 Head of Customer Experience & Online Sales
16 <NA>
17 <NA>
18 Founder and President
19 <NA>
20 <NA>
21 Product Manager
22 <NA>
23 Customer Value Manager
24 <NA>
25 Lead Software Developer
...
我需要的输出是:
Founder and President Product Manager
0 1
1 0
0 1
1 0
我获得的输出是:
Founder and President Product Manager Founder and President Product Manager
0 1 0 0
1 0 0 0
0 0 1 0
0 0 0 1
我拥有的代码是:
library(plyr)
library(stringr)
library(gdata)
library(readxl)
train <- read_excel("data.xlsx")
#looping through the jobTitle column
for(i in 1:sum(nrow(train[4]))){
if ((!is.na(train[i,4])) {
#split every value by the comma, convert to lower case
list2char <- strsplit(tolower(train$jobTitle[i]),",", fixed = T)
for(j in 1:length(list2char[[1]])) {
#populate the current observation for the newly created column with 1
if(!(list2char[[1]][j] %in% names(train))){
#if the name does not match existing column name, create a new column and assign 1
train[i, str_trim(list2char[[1]][j])] <- 1
}else{
#if the name matches an existing column name, assign 1 to that column
}
}
}
}
#replace all NAs with 0s
train[is.na(train)] <- 0
我认为您正在尝试计算逗号删除字符串中每个变量的频率?
s<-data.frame(A=c("A1,B", "A2,C1"),B=c("B1,B2","C1,A1"), C=c("C1,C2,C3","C4"))
# A B C
#1 A1,B B1,B2 C1,C2,C3
#2 A2,C1 C1,A1 C4
table( unlist(apply(s,1, function(s.row) {
strsplit(s.row,",")
})) )
#A1 A2 B B1 B2 C1 C2 C3 C4
#2 1 1 1 1 3 1 1 1