Grep在应该返回字符串时返回NA

  • 本文关键字:返回 字符串 NA Grep r
  • 更新时间 :
  • 英文 :


我有一个名为df_parse的数据框架,它看起来像这样:

error             error_position
1       0                           
2       0                           
3       0                           
4       1                    24 - 26
5       1                    29 - 30
6       0                           
7       0                           
8       0                           
9       0                           
10      0                           
11      0                           
12      0                           
13      0                           
14      0                           
15      0                           
16      0                           
17      0                           
18      0                           
19      0                           
20      0                           
21      0                           
22      0                           
23      1                    78 - 78
24      0                           
25      1                    83 - 84
26      0                           
27      0                           
28      0                           
29      1                    92 - 92
30      1                    95 - 95
31      0                           
32      0                           
33      0                           
34      0                           
35      0                           
36      0                           
37      1                  111 - 113`

我想找到error_position列中的字符串在我的原始数据(字符矢量文件)中的位置。下面是原始数据的样例:

HUBUSL1 2   ENTER LINE NUMBER   81 - 82
FOR HUBUS = 1 VALID ENTRIES


83 - 84

VALID ENTRIES

1   MIN VALUE
99  MAX VALUE

HUBUSL3 2   See BUSL1   85 - 86

VALID ENTRIES

1   MIN VALUE
99  MAX VALUE

HUBUSL4 2   See BUSL1   87 - 88

VALID ENTRIES

1   MIN VALUE
99  MAX VALUE



A2. GEOGRAPHIC INFORMATION
GEREG   2   REGION  89 - 90

EDITED UNIVERSE:    ALL HHLD's IN SAMPLE VALID ENTRIES
1   NORTHEAST
2   MIDWEST (FORMERLY NORTH CENTRAL)
3   SOUTH
4   WEST

GEDIV   1   DIVISION    91 - 91

EDITED UNIVERSE:    ALL HHLD's IN SAMPLE VALID ENTRIES




92 – 92

GESTFIPS    2   FEDERAL INFORMATION 93 - 94
PROCESSING STANDARDS (FIPS) STATE CODE

例如,在df_parse数据框架的error_position列中,第25行"83"- 84";匹配第5行

中的原始文件
FOR HUBUS = 1 VALID ENTRIES

83 - 84

和"92 - 92"匹配样本原始数据文件的末尾:

92 – 92

GESTFIPS    2   FEDERAL INFORMATION 93 - 94

我写了一个for循环,使用grep返回"error_position"中模式值的元素位置。从原始数据向量。

results1<- vector(mode = "character", length = length(df_parse$error)) #empty vector
for(i in seq_along(df_parse$error)){
results1[i]<- ifelse(df_parse$error[i] == 1, grep(pattern = paste(df_parse$error_position[i]), x = raw, value = FALSE), "")

}
results1 

以下是示例结果:

[1] ""     ""     ""     "37"   "95"   ""     ""     ""     ""     ""     ""     ""     ""     ""    
[15] ""     ""     ""     ""     ""     ""     ""     ""     "288"  ""     "298"  ""     ""     ""    
[29] NA     "381"  ""     ""     ""     ""     ""     ""     "444"  ""     ""     ""     ""     ""    
[43] ""     "532"  ""     "551"  ""     ""     ""     ""     NA     ""     ""     "677"  ""     ""    
[57] "712"  ""     ""     ""     ""     ""     ""     ""     "838"  ""     ""     ""     ""     ""    
[71] ""     ""     NA     ""     ""     ""     ""     "991"  ""     ""     ""     ""     ""     ""    
[85] ""     ""     NA     "1140" ""     "1158" ""     ""     ""     ""     ""     ""     ""     ""    
[99] ""     ""     ""     ""     ""     "1283" ""     ""     ""     NA     ""     ""     ""     ""    
[113] ""     ""     ""     ""     ""     ""     ""     ""     ""     ""     ""     ""     ""     "1658"
[127] ""     ""     ""     NA     ""     ""     "1749" ""     ""     ""     ""     ""     ""     ""    
[141] "1824" ""     ""     ""     ""     ""     ""     ""     ""     ""     ""     ""     ""     ""    
[155] ""     ""     "2065" ""     ""     "2109" ""     ""     ""     ""     ""     "2161" ""     ""    
[169] ""     NA     ""     ""     ""     ""     ""     ""     ""     ""     "2344" ""     ""     ""    

所以这是我想要的结果,因为它告诉我所有的模式匹配发生在原始数据文件中,但是我注意到有"NAs">

我发现NAs是因为一些数字范围之间没有连字符,而是长破折号(em破折号)。例如,在原始数据中,"92 - 92";(这是一个长破折号/em破折号),我的基于error_position列的grep目前正在寻找一个常规的连字符,如"24 - 26">

我试着排除故障以查找长破折号/em破折号,但它仍然返回NA。例如,我知道在我的循环结果中,原始数据向量中的元素29通过查找"92 - 92"返回NA。而不是&"92 - 92&";(长破折号/em破折号).

我的问题:然而,当我尝试简单地在原始数据文件中搜索"92 - 92"值时,它返回NA,或者更确切地说是整数(0)

我的一些尝试:grep(pattern = "92 - 92", x = raw, value = FALSE) == integer(0)grep(pattern = paste(df_parse$error_position[29]), x = raw, value = FALSE) == integer(0)

希望听到任何建议。谢谢。

除了长连字符外,你还会寻找连字符吗?

你可以试试-

for(i in seq_along(df_parse$error)){

results1[i]<- if(df_parse$error[i] == 1) {
pat <- sub('-', '[-–]', df_parse$error_position[i])
res <- grep(pat, raw) 
if(length(res)) res[1] else ""
} else ""
}

最新更新