R:如何将字符串与现有列名称列表进行比较



我需要编写一个R代码,这将执行以下操作:

  • 使用循环通过列
  • 用逗号将每个值分开,并将它们分配到变量
  • 将该变量中的值与现有列名称进行比较
  • 如果不存在列名,请创建一个新列,每个comma分隔值
  • 将" 1"填充到该新列的观察中
  • 如果存在列名,请在现有列中添加" 1",并使用该名称

在操纵之前的数据(列)看起来像:

                                     jobTitle
1                                        <NA>
2                                        <NA>
3                                        <NA>
4   Functional Architect, Business Technology
5                                        <NA>
6                                        <NA>
7                                        <NA>
8                                        <NA>
9                                        <NA>
10                      Founder and President
11                            Product Manager
12                                       <NA>
13                                       <NA>
14                                       <NA>
15 Head of Customer Experience & Online Sales
16                                       <NA>
17                                       <NA>
18                      Founder and President
19                                       <NA>
20                                       <NA>
21                            Product Manager
22                                       <NA>
23                     Customer Value Manager
24                                       <NA>
25                    Lead Software Developer
  ...

我需要的输出是:

Founder and President  Product Manager
       0                       1        
       1                       0      
       0                       1
       1                       0

我获得的输出是:

Founder and President  Product Manager  Founder and President  Product Manager
       0                       1                   0                 0      
       1                       0                   0                 0     
       0                       0                   1                 0      
       0                       0                   0                 1

我拥有的代码是:

library(plyr)
library(stringr)
library(gdata) 
library(readxl)
train <- read_excel("data.xlsx")
#looping through the jobTitle column
for(i in 1:sum(nrow(train[4]))){ 
        if ((!is.na(train[i,4])) {
            #split every value by the comma, convert to lower case
            list2char <- strsplit(tolower(train$jobTitle[i]),",", fixed = T)
            for(j in 1:length(list2char[[1]])) {
                    #populate the current observation for the newly created column with 1
                    if(!(list2char[[1]][j] %in% names(train))){
                            #if the name does not match existing column name, create a new column and assign 1
                            train[i, str_trim(list2char[[1]][j])] <- 1
                    }else{
                            #if the name matches an existing column name, assign 1 to that column
                    }
            }
    }
}
#replace all NAs with 0s
train[is.na(train)] <- 0

我认为您正在尝试计算逗号删除字符串中每个变量的频率?

    s<-data.frame(A=c("A1,B", "A2,C1"),B=c("B1,B2","C1,A1"), C=c("C1,C2,C3","C4"))
    #      A     B        C
    #1  A1,B B1,B2 C1,C2,C3
    #2 A2,C1 C1,A1       C4
    table( unlist(apply(s,1, function(s.row) {
       strsplit(s.row,",")
    })) )
    #A1 A2  B B1 B2 C1 C2 C3 C4 
    #2  1  1  1  1  3  1  1  1

相关内容

  • 没有找到相关文章

最新更新