我有一个日志数据集:
V1 duration id startpoint
T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 7771 1 2012-05-07_12-29-51
T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn 7771 1 2012-05-07_12-29-51
T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 7771 1 2012-05-07_12-29-51
T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804 7771 1 2012-05-07_12-29-51 211
我试图从第一列(时间点,进程,pid, url等)提取信息。一开始我试着:
df$timepoint <- gsub("T<=>(.*)[=].*", "\1", df$V1)
它返回类似161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<
的东西,然后我尝试:
df$timepoint <- gsub("T<=>([0-9]*).*", "\1", df$V1)
它工作,但它不会工作时处理文本,如进程名,所以我搜索'regex最小匹配',发现术语non-greedy
。I try again:
df$timepoint <- gsub("T<=>(.*?)\[=\].*", "\1", df$V1)
df$process <- gsub(".*P<=>(.*?)\[=\].*", "\1", df$V1)
df$pid <- gsub(".*I<=>(.*?)\[=\].*", "\1", df$V1)
df$url <- gsub(".*U<=>(.*?)\[=\].*", "\1", df$V1)
df$addr <- gsub(".*A<=>(.*?)\[=\].*", "\1", df$V1)
df$tab <- gsub(".*B<=>(.*?)\[=\].*", "\1", df$V1)
df$ver <- gsub(".*V<=>(.*?)\[=\].*", "\1", df$V1)
df$window <- gsub(".*W<=>(.*?)\[=\].*", "\1", df$V1)
df$name <- gsub(".*N<=>(.*?)\[=\].*", "\1", df$V1)
df$company <- gsub(".*C<=>(.*?)", "\1", df$V1)
不是每一行都包含所有的信息和发生的问题。如果没有关于软件名称或公司名称的信息,R会简单地将V1复制到新的var中,如果软件版本信息在V1的末尾,那么正则表达式".*V<=>(.*?)\[=\].*"
也会将整个字符串复制到新的var中:
V1 duration id startpoint timepoint process pid url addr tab ver window name company
T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 7771 1 2012-05-07_12-29-51 161 explorer.exe 1820 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 20094 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512
T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn 7771 1 2012-05-07_12-29-51 195 360Safe.exe 1732 T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn 7, 5, 0, 1501 1017e 360安全卫士 360.cn
T<=>203[=]P<=>360chrome.exe[=]I<=>436[=]U<=>NULL[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804[=]N<=>360极速浏览器[=]C<=>360.cn 7771 1 2012-05-07_12-29-51 203 360chrome.exe 436 NULL 2027a 20290 5.2.0.804 T<=>203[=]P<=>360chrome.exe[=]I<=>436[=]U<=>NULL[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804[=]N<=>360极速浏览器[=]C<=>360.cn 360极速浏览器 360.cn
T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 7771 1 2012-05-07_12-29-51 209 360Safe.exe 1732 T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 1017e T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501
T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804 7771 1 2012-05-07_12-29-51 211 360chrome.exe 436 www.hao123.com 2027a 20290 T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804 T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804 T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804 T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804
我想如果R找不到'C<=>'(例如),那么在那之后就没有(.*?)。它将是一个空字符串,但输出占用了整个字符串。有人能帮我修一下吗?谢谢!
<标题> 更新感谢MrFlick的评论,我刚刚得到了一个基于这个答案的解决方案:
以软件名称信息提取过程为例,
ind1 <- grep(".*N<=>(.*?)\[=\].*", df$V1, value= FALSE) # see if pattern exists with follow-up
ind2 <- grep(".*N<=>(.*?)", df$V1, value= FALSE) # see if pattern exists
df$name <- ""
df$name[ind2] <- gsub(".*N<=>(.*?)", "\1", df$V1) # replace the ones with pattern match
df$name[ind1] <- gsub(".*N<=>(.*?)\[=\].*", "\1", df$V1) # replace the ones with pattern match and follow-up
但是这个片段看起来很糟糕,如果它是最终的解决方案,我必须通过它与其他信息(进程,pid,版本,公司等)…有人能帮我优化一下吗?谢谢!
标题>这是另一个策略。我们可以使用gregexpr
来分离堆叠数据的每个部分。这是向量
V1<-c("T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512",
"T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn",
"T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501",
"T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804")
现在我们可以用
来分割m <- gregexpr("(\w)<=>(.*?)(?:\[=\]|$)", V1, perl=T)
获取捕获的匹配可能会很混乱,但我使用regcapture - redmatches函数可以轻松获取所有匹配的数据。我使用它就像你使用内置的regmatches
data <- regcapturedmatches(V1,m)
然后如果你检查data
,你可以看到所有的信息都在那里。现在的问题是我们只需要把它建立成列而不是像现在这样的行。使用reshape2
library(reshape2)
#combine list into one data.frame
sdata<-do.call(rbind, lapply(1:length(data),
function(i) data.frame(data[[i]], S=i)))
#turn rows into columns
dcast(sdata, S~X1, value.var="X2")
返回
S I P T V W C N A B
1 1 1820 explorer.exe 161 6.00.2900.5512 20094 <NA> <NA> <NA> <NA>
2 2 1732 360Safe.exe 195 7, 5, 0, 1501 1017e 360.cn 360安全卫士 <NA> <NA>
3 3 1732 360Safe.exe 209 7, 5, 0, 1501 1017e <NA> <NA> <NA> <NA>
4 4 436 360chrome.exe 211 5.2.0.804 <NA> <NA> <NA> 2027a 20290
U
1 <NA>
2 <NA>
3 <NA>
4 www.hao123.com
您可以重命名列等等,但是要一次完成所有的转换真的没有那么多代码。