我有以下数据集:
id = 1:5
col1 = c("john", "henry", "adam", "jenna", "peter")
col2 = c("river B8C 9L4", "Field U9H 5E2 PP", "NA", "ocean A1B 5H1 dd", "dave")
col3 = c("matt", "steve", "forest K0Y 1U9 hu2", "NA", "NA")
col4 = c("Phone: 111 1111 111", "Phone: 222 2222", "Phone: 333 333 1113", "Phone: 444 111 1153", "Phone: 111 111 1121")
my_data = data.frame(id, col1, col2, col3, col4)
id col1 col2 col3 col4
1 1 john river B8C 9L4 matt Phone: 111 1111 111
2 2 henry Field U9H 5E2 PP steve Phone: 222 2222
3 3 adam NA forest K0Y 1U9 hu2 Phone: 333 333 1113
4 4 jenna ocean A1B 5H1 dd NA Phone: 444 111 1153
5 5 peter dave NA Phone: 111 111 1121
我找到了识别以下模式的REGEX代码-然后可以将其包装成函数:
apply(my_data, 1, function(x) gsub('(([A-Z] ?[0-9]){3})|.', '\1', toString(x)))
[1] "B8C 9L4" "U9H 5E2" "K0Y 1U9" "A1B 5H1" ""
一旦这样做了,有没有办法扩展这段代码,使
- 一旦确定了具有REGEX条件的行/列,则提取该行/列的整个内容?
例如:
[1] "river B8C 9L4 " Field U9H 5E2 PP" "forest K0Y 1U9 hu2" "ocean A1B 5H1 dd"
选项将遍历行,将非"NA"
或具有子字符串"Phone"的元素作为子集,然后将具有多个单词(str_count
)的元素作为子集
library(stringr)
na.omit(apply(my_data[-1], 1, (x)
{x <- x[x != "NA"]
x1 <- x[(!str_detect(x, "Phone"))]
x1[str_count(x1, "\w+") > 1][1]
})
与产出
[1] "river B8C 9L4" "Field U9H 5E2 PP"
[3] "forest K0Y 1U9 hu2" "ocean A1B 5H1 dd"