我也接受pandas解决方案,我的公司不喜欢使用r.
我得到了一个数据集的噩梦,需要一些帮助,使用tidyr 在r中转换它
示例df记录:
id date people things
12 12/12/12 last, first [id124] last, first middle [id1782] thing 1nthing 2nthing 3n thing 4nthing 5
我需要根据他们的ID对他们进行拆分,然后拆分东西并将其与人匹配。事物按顺序在人与人之间被分隔开;\n〃;。
所需的最终结果:
id date people things
12 12/12/12 last, first [id124] thing 1
12 12/12/12 last, first [id124] thing 2
12 12/12/12 last, first [id124] thing 3
12 12/12/12 last, first middle [id1782] thing 4
12 12/12/12 last, first middle [id1782] thing 5
我无法做出足够好的尝试,甚至无法在这里分享。
我们可以使用双cSplit
,即首先在]
处拆分,然后是空格或(|
(换行符(n
(,其中包含超过1个空格(\s{2,}
(。在返回的"long"格式中,在换行符的"things"列上进行第二次拆分,如果需要,在"people"中恢复在拆分中删除的]
(regex lookaround似乎不适用于cSplit
(
library(splitstackshape)
library(dplyr)
library(stringr)
cSplit(df1, c("people", "things"), sep='\] |n\s{2,}', 'long',
fixed = FALSE) %>%
cSplit("things", sep="n", "long") %>%
mutate(people = str_replace(people, "(\d+)$", "\1]"))
-输出
# id date people things
#1: 12 12/12/12 last, first [id124] thing 1
#2: 12 12/12/12 last, first [id124] thing 2
#3: 12 12/12/12 last, first [id124] thing 3
#4: 12 12/12/12 last, first middle [id1782] thing 4
#5: 12 12/12/12 last, first middle [id1782] thing 5
数据
df1 <- structure(list(id = 12L, date = "12/12/12", people = "last, first [id124] last, first middle [id1782]",
things = "thing 1nthing 2nthing 3n thing 4nthing 5"),
row.names = c(NA,
-1L), class = "data.frame")