R. Apache Web服务器数据的字符串操纵



我有一个apache Web服务器数据的数据文件,我想解析文件并创建由日志的不同部分组成的数据帧。这将要求我进行一些弦操作和正则表达式的用法。但是,我在弦乐操作方面的经验非常有限。

数据的每一行都是一个日志,例如:

[1] "79.133.215.123 - - [14/Jun/2014:10:30:13 -0400] "GET /home HTTP/1.1" 200 1671 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36""

对于IP地址,我使用Regexpr函数来识别第一个空间,然后根据第一个空间进行子字符串,例如:

> first_space <- regexpr(pattern = " ", text=web_logs)
> IP <- substr(x=web_logs, start=1, stop=first_space-1)

但是,对于其他要提取的变量,我对自己的能力感到困惑。例如,如果我想提取方括号所包含的日期,我尝试使用regexpr,其中pattern =" [",但我收到了一个错误。

我还可以利用哪些其他功能来提取所需的信息?

作为解决问题的快速解决,dplyrtidyr数据操纵工具可以帮助您。separate()将通过简单的正则弦来解析您的字符串,然后您可以在列上使用select()merge()来形成数据框架。

string <- "79.133.215.123 - - [14/Jun/2014:10:30:13 -0400] "GET /home HTTP/1.1" 200 1671 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36""
library(tidyr)
library(stringr)
string.df <- as.data.frame(string) %>%
separate(string, paste0("x", seq(1:(str_count(string, " ")+1))), sep = " ", extra = "merge")

extra参数设置为"合并"是为了安全性 - 如果separate()超出列,它将保留最后一个中剩下的所有内容。结果:

x1 x2 x3                    x4     x5   x6    x7        x8  x9  x10 x11
1 79.133.215.123  -  - [14/Jun/2014:10:30:13 -0400] "GET /home HTTP/1.1" 200 1671 "-"
           x12      x13 x14  x15    x16                x17     x18  x19    x20
1 "Mozilla/5.0 (Windows  NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
                   x21            x22
1 Chrome/35.0.1916.153 Safari/537.36"

示例数据:

web_logs <- c("79.133.215.123 - - [14/Jun/2014:10:30:13 -0400] "GET /home HTTP/1.1" 200 1671 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"",
              "162.235.161.200 - - [14/Jun/2014:10:30:13 -0400] "GET /department/apparel/category/featured%20shops/product/adidas%20Kids'%20RG%20III%20Mid%20Football%20Cleat HTTP/1.1" 200 1175 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.76.4 (KHTML, like Gecko) Version/7.0.4 Safari/537.76.4"")

从我组合的日志中检索请求((和sapply((:

split_log <- strsplit(x = web_logs, split=" ")
request <- sapply(split_log, "[", 6)

返回一个字符向量的,以下样本:

> request[1:2]
[1] ""GET" ""GET"

现在我要做的就是从请求中删除"。

这不是正确的正则表达式。使用http://statmodeling.com/regular-expression-for-apache-log-parsing.html和https://httpd.apache.org/docs/2.4/logs.html作为指导,我想到了:

web_logs <- rep("79.133.215.123 - - [14/Jun/2014:10:30:13 -0400] "GET /home HTTP/1.1" 200 1671 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"",
                3)
library(stringi)
apache.log.lazy.regex <- "([\d.]+) ([\w.-]+) ([\w.-]+) \[(.*)?\] "(.*)?" (\d{3}) ([\d-]+) "(.*)?" "(.*)?""
do.call(rbind, stri_match_all_regex(web_logs, apache.log.lazy.regex))[, -1]
##      [,1]             [,2] [,3] [,4]                         [,5]                 [,6]  [,7]   [,8]
## [1,] "79.133.215.123" "-"  "-"  "14/Jun/2014:10:30:13 -0400" "GET /home HTTP/1.1" "200" "1671" "-" 
## [2,] "79.133.215.123" "-"  "-"  "14/Jun/2014:10:30:13 -0400" "GET /home HTTP/1.1" "200" "1671" "-" 
## [3,] "79.133.215.123" "-"  "-"  "14/Jun/2014:10:30:13 -0400" "GET /home HTTP/1.1" "200" "1671" "-" 
##      [,9]                                                                                                           
## [1,] "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
## [2,] "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
## [3,] "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"

它适用于这种情况,我认为通常在大多数情况下都可以使用。可能有一些示例会失败,例如,如果用户代理字段中有嵌入式引号。

最新更新