R:如何将字符串与现有列名称列表进行比较

我需要编写一个R代码，这将执行以下操作：

使用循环通过列
用逗号将每个值分开，并将它们分配到变量
将该变量中的值与现有列名称进行比较
如果不存在列名，请创建一个新列，每个comma分隔值
将" 1"填充到该新列的观察中
如果存在列名，请在现有列中添加" 1"，并使用该名称

在操纵之前的数据（列）看起来像：

                                     jobTitle
1                                        <NA>
2                                        <NA>
3                                        <NA>
4   Functional Architect, Business Technology
5                                        <NA>
6                                        <NA>
7                                        <NA>
8                                        <NA>
9                                        <NA>
10                      Founder and President
11                            Product Manager
12                                       <NA>
13                                       <NA>
14                                       <NA>
15 Head of Customer Experience & Online Sales
16                                       <NA>
17                                       <NA>
18                      Founder and President
19                                       <NA>
20                                       <NA>
21                            Product Manager
22                                       <NA>
23                     Customer Value Manager
24                                       <NA>
25                    Lead Software Developer
  ...

我需要的输出是：

Founder and President  Product Manager
       0                       1        
       1                       0      
       0                       1
       1                       0

我获得的输出是：

Founder and President  Product Manager  Founder and President  Product Manager
       0                       1                   0                 0      
       1                       0                   0                 0     
       0                       0                   1                 0      
       0                       0                   0                 1

我拥有的代码是：

library(plyr)
library(stringr)
library(gdata) 
library(readxl)
train <- read_excel("data.xlsx")
#looping through the jobTitle column
for(i in 1:sum(nrow(train[4]))){ 
        if ((!is.na(train[i,4])) {
            #split every value by the comma, convert to lower case
            list2char <- strsplit(tolower(train$jobTitle[i]),",", fixed = T)
            for(j in 1:length(list2char[[1]])) {
                    #populate the current observation for the newly created column with 1
                    if(!(list2char[[1]][j] %in% names(train))){
                            #if the name does not match existing column name, create a new column and assign 1
                            train[i, str_trim(list2char[[1]][j])] <- 1
                    }else{
                            #if the name matches an existing column name, assign 1 to that column
                    }
            }
    }
}
#replace all NAs with 0s
train[is.na(train)] <- 0

我认为您正在尝试计算逗号删除字符串中每个变量的频率？

    s<-data.frame(A=c("A1,B", "A2,C1"),B=c("B1,B2","C1,A1"), C=c("C1,C2,C3","C4"))
    #      A     B        C
    #1  A1,B B1,B2 C1,C2,C3
    #2 A2,C1 C1,A1       C4
    table( unlist(apply(s,1, function(s.row) {
       strsplit(s.row,",")
    })) )
    #A1 A2  B B1 B2 C1 C2 C3 C4 
    #2  1  1  1  1  3  1  1  1

相关内容

最新更新

热门标签：