Non-greedy gsub



我有一个日志数据集:

V1  duration  id  startpoint
T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  7771    1   2012-05-07_12-29-51
T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn 7771    1   2012-05-07_12-29-51
T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    7771    1   2012-05-07_12-29-51
T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804  7771    1   2012-05-07_12-29-51 211

我试图从第一列(时间点,进程,pid, url等)提取信息。一开始我试着:

df$timepoint <- gsub("T<=>(.*)[=].*", "\1", df$V1)

它返回类似161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<的东西,然后我尝试:

df$timepoint <- gsub("T<=>([0-9]*).*", "\1", df$V1)

它工作,但它不会工作时处理文本,如进程名,所以我搜索'regex最小匹配',发现术语non-greedy。I try again:

df$timepoint <- gsub("T<=>(.*?)\[=\].*", "\1", df$V1)
df$process <- gsub(".*P<=>(.*?)\[=\].*", "\1", df$V1)
df$pid <- gsub(".*I<=>(.*?)\[=\].*", "\1", df$V1)
df$url <- gsub(".*U<=>(.*?)\[=\].*", "\1", df$V1)
df$addr <- gsub(".*A<=>(.*?)\[=\].*", "\1", df$V1)
df$tab <- gsub(".*B<=>(.*?)\[=\].*", "\1", df$V1)
df$ver <- gsub(".*V<=>(.*?)\[=\].*", "\1", df$V1)
df$window <- gsub(".*W<=>(.*?)\[=\].*", "\1", df$V1)
df$name <- gsub(".*N<=>(.*?)\[=\].*", "\1", df$V1)
df$company <- gsub(".*C<=>(.*?)", "\1", df$V1)

不是每一行都包含所有的信息和发生的问题。如果没有关于软件名称或公司名称的信息,R会简单地将V1复制到新的var中,如果软件版本信息在V1的末尾,那么正则表达式".*V<=>(.*?)\[=\].*"也会将整个字符串复制到新的var中:

V1  duration  id  startpoint  timepoint process pid url addr  tab ver window  name  company
T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  7771    1   2012-05-07_12-29-51 161 explorer.exe    1820    T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  20094   T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512
T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn 7771    1   2012-05-07_12-29-51 195 360Safe.exe 1732    T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn 7, 5, 0, 1501   1017e   360安全卫士 360.cn
T<=>203[=]P<=>360chrome.exe[=]I<=>436[=]U<=>NULL[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804[=]N<=>360极速浏览器[=]C<=>360.cn    7771    1   2012-05-07_12-29-51 203 360chrome.exe   436 NULL    2027a   20290   5.2.0.804   T<=>203[=]P<=>360chrome.exe[=]I<=>436[=]U<=>NULL[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804[=]N<=>360极速浏览器[=]C<=>360.cn    360极速浏览器    360.cn
T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    7771    1   2012-05-07_12-29-51 209 360Safe.exe 1732    T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    1017e   T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501
T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804  7771    1   2012-05-07_12-29-51 211 360chrome.exe   436 www.hao123.com  2027a   20290   T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804  T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804  T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804  T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804

我想如果R找不到'C<=>'(例如),那么在那之后就没有(.*?)。它将是一个空字符串,但输出占用了整个字符串。有人能帮我修一下吗?谢谢!

<标题> 更新

感谢MrFlick的评论,我刚刚得到了一个基于这个答案的解决方案:

以软件名称信息提取过程为例,

ind1 <- grep(".*N<=>(.*?)\[=\].*", df$V1, value= FALSE) # see if pattern exists with follow-up
ind2 <- grep(".*N<=>(.*?)", df$V1, value= FALSE) # see if pattern exists
df$name <- "" 
df$name[ind2] <- gsub(".*N<=>(.*?)", "\1", df$V1) # replace the ones with pattern match
df$name[ind1] <- gsub(".*N<=>(.*?)\[=\].*", "\1", df$V1) # replace the ones with pattern match and follow-up

但是这个片段看起来很糟糕,如果它是最终的解决方案,我必须通过它与其他信息(进程,pid,版本,公司等)…有人能帮我优化一下吗?谢谢!

这是另一个策略。我们可以使用gregexpr来分离堆叠数据的每个部分。这是向量

中的数据
V1<-c("T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512", 
"T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn", 
"T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501", 
"T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804")

现在我们可以用

来分割
m <- gregexpr("(\w)<=>(.*?)(?:\[=\]|$)", V1, perl=T)

获取捕获的匹配可能会很混乱,但我使用regcapture - redmatches函数可以轻松获取所有匹配的数据。我使用它就像你使用内置的regmatches

data <- regcapturedmatches(V1,m)

然后如果你检查data,你可以看到所有的信息都在那里。现在的问题是我们只需要把它建立成列而不是像现在这样的行。使用reshape2

library(reshape2)
#combine list into one data.frame
sdata<-do.call(rbind, lapply(1:length(data), 
    function(i) data.frame(data[[i]], S=i)))    
#turn rows into columns
dcast(sdata, S~X1, value.var="X2")

返回

  S    I             P   T              V     W      C           N     A     B
1 1 1820  explorer.exe 161 6.00.2900.5512 20094   <NA>        <NA>  <NA>  <NA>
2 2 1732   360Safe.exe 195  7, 5, 0, 1501 1017e 360.cn 360安全卫士  <NA>  <NA>
3 3 1732   360Safe.exe 209  7, 5, 0, 1501 1017e   <NA>        <NA>  <NA>  <NA>
4 4  436 360chrome.exe 211      5.2.0.804  <NA>   <NA>        <NA> 2027a 20290
               U
1           <NA>
2           <NA>
3           <NA>
4 www.hao123.com

您可以重命名列等等,但是要一次完成所有的转换真的没有那么多代码。

相关内容

  • 没有找到相关文章

最新更新