删除R中文本前后的所有字符,然后从新文本创建列



所以我有一个字符串,我试图解析,然后用我提取的数据创建3列。从我所看到的,stringr并没有真正覆盖这种情况,到目前为止我使用的gsub是过多的,涉及到我创建多个列,从这些新列解析,然后删除它们,这似乎真的很低效。

格式如下:

"blah, grabbed by ???-??-?????."

我需要这个:

???-??-?????

我在这里使用了占位符,但这就是字符串通常的样子

"blah, grabbed by PHI-80-J.Matthews."

"blah, grabbed by NE-5-J.Mills."

,有时在名字后面有这样的文本:

"blah, grabbed by KC-10-T.Hill. Blah blah blah."

这是我想要的最终结果:

<表类> 地方数量名称tbody><<tr>φ80J。马修斯东北5J。米尔斯KC10T。希尔

此解决方案简单地根据所提到的逻辑OP对组件进行extract,即捕获作为三组所需的字符- 1)一个或多个大写字母([A-Z]+)后跟破折号(-), 2)然后一个或多个数字(\d+),最后3)破折号后面的非空白字符(\S+)

library(tidyr)
extract(df1, col1, into = c("Place", "Number", "Name"), 
".*grabbed by\s([A-Z]+)-(\d+)-(\S+)\..*", convert = TRUE)

-ouputt

# A tibble: 4 x 3
Place Number Name      
<chr>  <int> <chr>     
1 PHI       80 J.Matthews
2 NE         5 J.Mills   
3 KC        10 T.Hill    
4 KC        10 T.Hill    

或者在base R

read.table(text = sub(".*grabbed by\s((\w+-){2}\S+)\..*", "\1", 
df1$col1), header = FALSE, col.names = c("Place", "Number", "Name"), sep='-')
Place Number       Name
1   PHI     80 J.Matthews
2    NE      5    J.Mills
3    KC     10     T.Hill

数据
df1 <- structure(list(col1 = c("blah, grabbed by PHI-80-J.Matthews.", 
"blah, grabbed by NE-5-J.Mills.", "blah, grabbed by KC-10-T.Hill. Blah blah blah.", 
"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))

这个解决方案实际上做了你在标题中所说的,即首先删除目标子字符串周围的文本,然后将其分成几列:

library(tidyr)
library(stringr)
df1 %>%
mutate(col1 = str_extract(col1, "\w+-\w+-\w\.\w+")) %>%
separate(col1, 
into = c("Place", "Number", "Name"), 
sep = "-")
# A tibble: 3 x 3
Place Number Name      
<chr> <chr>  <chr>     
1 PHI   80     J.Matthews
2 NE    5      J.Mills   
3 KC    10     T.Hill 

在这里,我们利用了字符类\w适用于不分大小写的字母和数字(以及下划线)这一事实。

这里是使用sub和regex"([A-Za-z]+\.[A-Za-z]+).*", "\1"的另一种方法,它删除第二个点之后的字符串。separate将字符串拆分为by,最后再拆分为separate以获得所需的列。

library(dplyr)
library(tidyr)
df1 %>% 
mutate(test1 = sub("([A-Za-z]+\.[A-Za-z]+).*", "\1", col1)) %>% 
separate(test1, c('remove', 'keep'), sep = " by ") %>% 
separate(keep, c("Place", "Number", "Name"), sep = "-") %>% 
select(Place, Number, Name)

输出:

Place Number Name      
<chr> <chr>  <chr>     
1 PHI   80     J.Matthews
2 NE    5      J.Mills   
3 KC    10     T.Hill

最新更新