R - 将文本字符串解析为多列并提取数据值



我有一个大数据集,格式如下:

ID 得分
1 英语3,法语7,地理8
2 西班牙语7,经典4
3 物理5、英语5、体育7、艺术4

您可以首先将strsplitScores列按主题分数对(在列表中(进行拆分,然后将列表列unnest划分为行。然后separate将受试者得分配对为SubjectScore列。最后将数据从";"长";格式为";宽";总体安排

谢谢@G。Grothendieck改进我的代码:(

library(tidyverse)
df %>% 
separate_rows(Scores, sep = ", ") %>% 
separate(Scores, sep = " ", into = c("Subject", "Score")) %>% 
pivot_wider(names_from = "Subject", values_from = "Score")
# A tibble: 3 × 9
ID English French Geography Spanish Classics Physics PE    Art  
<int> <chr>   <chr>  <chr>     <chr>   <chr>    <chr>   <chr> <chr>
1     1 3       7      8         NA      NA       NA      NA    NA   
2     2 NA      NA     NA        7       4        NA      NA    NA   
3     3 5       NA     NA        NA      NA       5       7     4    

使用数据表

library(data.table)
setDT(dt)
dt <- dt[, .(class_grade = unlist(str_split(Scores, ", "))), by = ID]
dt[, c("class", "grade") := tstrsplit(class_grade, " ")]
dcast(dt, ID ~ class, value.var = c("grade"), sep = "")

结果

#    ID  Art Classics English French Geography   PE Physics Spanish
# 1:  1 <NA>     <NA>       3      7         8 <NA>    <NA>    <NA>
# 2:  2 <NA>        4    <NA>   <NA>      <NA> <NA>    <NA>       7
# 3:  3    4     <NA>       5   <NA>      <NA>    7       5    <NA>

数据

dt <- structure(list(ID = 1:3, Scores = c("English 3, French 7, Geography 8", 
"Spanish 7, Classics 4", "Physics 5, English 5, PE 7, Art 4")), row.names = c(NA, 
-3L), class = c("data.frame"))

最新更新