用多个空格或制表符(而不是单个空格)将字符串矢量分隔成列



我有一个包含一列的数据帧,其内容是从一个混乱的PDF表中提取的:

my_df <- structure(list(value = c("Jon         Doe          Managing Director                                           My Company                                                   Elk View            IL      (312) 726-1578      email5@email.com", 
"John        Smith           Director                                                    Acme Corp                       Springfield          IA      (111) 111-1111      email1@email.com", 
"Mike          Jones           Manager              MyCo inc                                                        Jonestown        MN      (111) 111-1111      email2@email.com", 
"Dorothy       Baker          CEO                                           Our Company Inc                                              Philadelphia       PA      (111) 111-111      email3@email.com"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))

我正在尝试将其分成多个列。有些值,比如职位和电话号码,里面有空格,所以我需要用多个空格或制表符分隔。

如果可能的话,我想使用tidyverse中的separate,所以基本代码可能看起来像:

pdf_list_df |>
separate(
value,
c(
"First Name",
"Last Name",
"Job Title",
"Company Name",
"City",
"State",
"Phone Number",
"Email Address"
)
)

我只是被什么regex或选项所困扰。我在这里看到了其他语言的解决方案,但没有R。谢谢。

my_df %>%
separate(value,
into =     c(
"First Name",
"Last Name",
"Job Title",
"Company Name",
"City",
"State",
"Phone Number",
"Email Address"
),
sep = "\s{2,}")
# A tibble: 4 × 8
`First Name` `Last Name` `Job Title`       `Company Name`  City         State `Phone Number` `Email Address` 
<chr>        <chr>       <chr>             <chr>           <chr>        <chr> <chr>          <chr>           
1 Jon          Doe         Managing Director My Company      Elk View     IL    (312) 726-1578 email5@email.com
2 John         Smith       Director          Acme Corp       Springfield  IA    (111) 111-1111 email1@email.com
3 Mike         Jones       Manager           MyCo inc        Jonestown    MN    (111) 111-1111 email2@email.com
4 Dorothy      Baker       CEO               Our Company Inc Philadelphia PA    (111) 111-111  email3@email.com

regex在这里非常简单:因为元素之间有多个空白字符,所以要提取的拆分模式只需匹配至少两个但可能更多的空白字符。方便的是,形成一个单元的"单词",例如";本公司";它们之间只有一个空格,所以拆分模式在这里不匹配。

最新更新