我有一个大学篮球对决的向量:
c("#34 Colorado at #36 California", "#31 Utah at #87 Stanford",
"#26 USC at #112 Wash State", "#56 UCLA at #134 Washington",
"#187 W Illinois at #116 Neb Omaha", "#222 Denver at #58 S Dakota St",
"#245 IUPUI at #170 South Dakota", "#268 Rice at #208 TX El Paso",
"#274 North Texas at #344 TX-San Ant", "#14 Iowa at #3 Purdue"
)
我想要两个单独的向量:一个用于at
之前的团队,另一个用于at
之后出现的团队。例如)第一个向量有Colorado
、Utah
、USC
等,第二个向量有California
、Stanford
、Wash State
等。
请注意,我不想要#排名。我只想要团队名称。我试过str_split
,但效果不太好,因为间距都不一致。
我们可以在"at"上使用strsplit
和拆分,这将给我们 2 个字符串部分,从每个部分中删除"#"后跟数字并将其放入数据帧中。
data.frame(t(sapply(strsplit(string, "\bat\b"),
function(x) trimws(sub("#[0-9]+", "", x)))))
# X1 X2
#1 Colorado California
#2 Utah Stanford
#3 USC Wash State
#4 UCLA Washington
#5 W Illinois Neb Omaha
#6 Denver S Dakota St
#7 IUPUI South Dakota
#8 Rice TX El Paso
#9 North Texas TX-San Ant
#10 Iowa Purdue
或使用tidyr::separate
tidyr::separate(data.frame(col = trimws(gsub("#[0-9]+", "", string))),
col, into = c("T1", "T2"), sep = "\bat\b")
# T1 T2
#1 Colorado California
#2 Utah Stanford
#3 USC Wash State
#4 UCLA Washington
#5 W Illinois Neb Omaha
#6 Denver S Dakota St
#7 IUPUI South Dakota
#8 Rice TX El Paso
#9 North Texas TX-San Ant
#10 Iowa Purdue
另一种str_extract_all()
解决方案
df <- data.frame(stringsAsFactors = FALSE,
text = c("#34 Colorado at #36 California", "#31 Utah at #87 Stanford",
"#26 USC at #112 Wash State", "#56 UCLA at #134 Washington",
"#187 W Illinois at #116 Neb Omaha", "#222 Denver at #58 S Dakota St",
"#245 IUPUI at #170 South Dakota", "#268 Rice at #208 TX El Paso",
"#274 North Texas at #344 TX-San Ant", "#14 Iowa at #3 Purdue")
)
library(stringr)
library(dplyr)
df %>%
mutate(team_a = str_extract_all(text, "(?<=\s).+(?=\s+at)"),
team_b = str_extract_all(text, "(?<=\d\s)[^\d]+$"))
#> text team_a team_b
#> 1 #34 Colorado at #36 California Colorado California
#> 2 #31 Utah at #87 Stanford Utah Stanford
#> 3 #26 USC at #112 Wash State USC Wash State
#> 4 #56 UCLA at #134 Washington UCLA Washington
#> 5 #187 W Illinois at #116 Neb Omaha W Illinois Neb Omaha
#> 6 #222 Denver at #58 S Dakota St Denver S Dakota St
#> 7 #245 IUPUI at #170 South Dakota IUPUI South Dakota
#> 8 #268 Rice at #208 TX El Paso Rice TX El Paso
#> 9 #274 North Texas at #344 TX-San Ant North Texas TX-San Ant
#> 10 #14 Iowa at #3 Purdue Iowa Purdue
创建于 2019-03-29 由 reprex 软件包 (v0.2.1)
我们可以在base R
中通过从"text"列中删除子字符串并使用read.csv
来做到这一点
read.csv(text = trimws(gsub("#\d+", "", gsub("\s+at\s+", ",", df$text))),
header = FALSE, col.names = c("T1", "T2"), stringsAsFactors = FALSE)
# T1 T2
#1 Colorado California
#2 Utah Stanford
#3 USC Wash State
#4 UCLA Washington
#5 W Illinois Neb Omaha
#6 Denver S Dakota St
#7 IUPUI South Dakota
#8 Rice TX El Paso
#9 North Texas TX-San Ant
#10 Iowa Purdue