我有一个名为df_parse的数据框架,它看起来像这样:
error error_position
1 0
2 0
3 0
4 1 24 - 26
5 1 29 - 30
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 1 78 - 78
24 0
25 1 83 - 84
26 0
27 0
28 0
29 1 92 - 92
30 1 95 - 95
31 0
32 0
33 0
34 0
35 0
36 0
37 1 111 - 113`
我想找到error_position列中的字符串在我的原始数据(字符矢量文件)中的位置。下面是原始数据的样例:
HUBUSL1 2 ENTER LINE NUMBER 81 - 82
FOR HUBUS = 1 VALID ENTRIES
83 - 84
VALID ENTRIES
1 MIN VALUE
99 MAX VALUE
HUBUSL3 2 See BUSL1 85 - 86
VALID ENTRIES
1 MIN VALUE
99 MAX VALUE
HUBUSL4 2 See BUSL1 87 - 88
VALID ENTRIES
1 MIN VALUE
99 MAX VALUE
A2. GEOGRAPHIC INFORMATION
GEREG 2 REGION 89 - 90
EDITED UNIVERSE: ALL HHLD's IN SAMPLE VALID ENTRIES
1 NORTHEAST
2 MIDWEST (FORMERLY NORTH CENTRAL)
3 SOUTH
4 WEST
GEDIV 1 DIVISION 91 - 91
EDITED UNIVERSE: ALL HHLD's IN SAMPLE VALID ENTRIES
92 – 92
GESTFIPS 2 FEDERAL INFORMATION 93 - 94
PROCESSING STANDARDS (FIPS) STATE CODE
例如,在df_parse数据框架的error_position列中,第25行"83"- 84";匹配第5行
中的原始文件FOR HUBUS = 1 VALID ENTRIES
83 - 84
和"92 - 92"匹配样本原始数据文件的末尾:
92 – 92
GESTFIPS 2 FEDERAL INFORMATION 93 - 94
我写了一个for循环,使用grep返回"error_position"中模式值的元素位置。从原始数据向量。
results1<- vector(mode = "character", length = length(df_parse$error)) #empty vector
for(i in seq_along(df_parse$error)){
results1[i]<- ifelse(df_parse$error[i] == 1, grep(pattern = paste(df_parse$error_position[i]), x = raw, value = FALSE), "")
}
results1
以下是示例结果:
[1] "" "" "" "37" "95" "" "" "" "" "" "" "" "" ""
[15] "" "" "" "" "" "" "" "" "288" "" "298" "" "" ""
[29] NA "381" "" "" "" "" "" "" "444" "" "" "" "" ""
[43] "" "532" "" "551" "" "" "" "" NA "" "" "677" "" ""
[57] "712" "" "" "" "" "" "" "" "838" "" "" "" "" ""
[71] "" "" NA "" "" "" "" "991" "" "" "" "" "" ""
[85] "" "" NA "1140" "" "1158" "" "" "" "" "" "" "" ""
[99] "" "" "" "" "" "1283" "" "" "" NA "" "" "" ""
[113] "" "" "" "" "" "" "" "" "" "" "" "" "" "1658"
[127] "" "" "" NA "" "" "1749" "" "" "" "" "" "" ""
[141] "1824" "" "" "" "" "" "" "" "" "" "" "" "" ""
[155] "" "" "2065" "" "" "2109" "" "" "" "" "" "2161" "" ""
[169] "" NA "" "" "" "" "" "" "" "" "2344" "" "" ""
所以这是我想要的结果,因为它告诉我所有的模式匹配发生在原始数据文件中,但是我注意到有"NAs">
我发现NAs是因为一些数字范围之间没有连字符,而是长破折号(em破折号)。例如,在原始数据中,"92 - 92";(这是一个长破折号/em破折号),我的基于error_position列的grep目前正在寻找一个常规的连字符,如"24 - 26">
我试着排除故障以查找长破折号/em破折号,但它仍然返回NA。例如,我知道在我的循环结果中,原始数据向量中的元素29通过查找"92 - 92"返回NA。而不是&"92 - 92&";(长破折号/em破折号).
我的问题:然而,当我尝试简单地在原始数据文件中搜索"92 - 92"值时,它返回NA,或者更确切地说是整数(0)
我的一些尝试:grep(pattern = "92 - 92", x = raw, value = FALSE) == integer(0)grep(pattern = paste(df_parse$error_position[29]), x = raw, value = FALSE) == integer(0)
希望听到任何建议。谢谢。
除了长连字符外,你还会寻找连字符吗?
你可以试试-
for(i in seq_along(df_parse$error)){
results1[i]<- if(df_parse$error[i] == 1) {
pat <- sub('-', '[-–]', df_parse$error_position[i])
res <- grep(pat, raw)
if(length(res)) res[1] else ""
} else ""
}