我有一个包含一列的数据帧,其内容是从一个混乱的PDF表中提取的:
my_df <- structure(list(value = c("Jon Doe Managing Director My Company Elk View IL (312) 726-1578 email5@email.com",
"John Smith Director Acme Corp Springfield IA (111) 111-1111 email1@email.com",
"Mike Jones Manager MyCo inc Jonestown MN (111) 111-1111 email2@email.com",
"Dorothy Baker CEO Our Company Inc Philadelphia PA (111) 111-111 email3@email.com"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
我正在尝试将其分成多个列。有些值,比如职位和电话号码,里面有空格,所以我需要用多个空格或制表符分隔。
如果可能的话,我想使用tidyverse
中的separate
,所以基本代码可能看起来像:
pdf_list_df |>
separate(
value,
c(
"First Name",
"Last Name",
"Job Title",
"Company Name",
"City",
"State",
"Phone Number",
"Email Address"
)
)
我只是被什么regex或选项所困扰。我在这里看到了其他语言的解决方案,但没有R。谢谢。
my_df %>%
separate(value,
into = c(
"First Name",
"Last Name",
"Job Title",
"Company Name",
"City",
"State",
"Phone Number",
"Email Address"
),
sep = "\s{2,}")
# A tibble: 4 × 8
`First Name` `Last Name` `Job Title` `Company Name` City State `Phone Number` `Email Address`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Jon Doe Managing Director My Company Elk View IL (312) 726-1578 email5@email.com
2 John Smith Director Acme Corp Springfield IA (111) 111-1111 email1@email.com
3 Mike Jones Manager MyCo inc Jonestown MN (111) 111-1111 email2@email.com
4 Dorothy Baker CEO Our Company Inc Philadelphia PA (111) 111-111 email3@email.com
regex在这里非常简单:因为元素之间有多个空白字符,所以要提取的拆分模式只需匹配至少两个但可能更多的空白字符。方便的是,形成一个单元的"单词",例如";本公司";它们之间只有一个空格,所以拆分模式在这里不匹配。