通过正则表达式从文件中提取数据


146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622
197.109.77.178 - kertzmann3129 [21/Jun/2019:15:45:25 -0700] "DELETE /virtual/solutions/target/web+services HTTP/2.0" 203 26554
156.127.178.177 - okuneva5222 [21/Jun/2019:15:45:27 -0700] "DELETE /interactive/transparent/niches/revolutionize HTTP/1.1" 416 14701
100.32.205.59 - ortiz8891 [21/Jun/2019:15:45:28 -0700] "PATCH /architectures HTTP/1.0" 204 6048

我只想把上面的数据转换成一个字典列表,每个字典看起来如下:

example_dict = {"host":"146.204.224.152", 
"user_name":"feest6811", 
"time":"21/Jun/2019:15:45:24 -0700",
"request":"POST /incentivize HTTP/1.1"}

请帮帮我,我是新来的!!

您可以使用

^
(?P<host>d+S+)[-s]+
(?P<user_name>S+)s+
[(?P<time>[^][]+)]s+
"(?P<request>[^"]+)"

请参阅regex101.com上的演示。


Python中,这可能是

import re
pattern = re.compile(r"""
^
(?P<host>d+S+)[-s]+
(?P<user_name>S+)s+
[(?P<time>[^][]+)]s+
"(?P<request>[^"]+)"
""", re.MULTILINE | re.VERBOSE)
data = """
146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622
197.109.77.178 - kertzmann3129 [21/Jun/2019:15:45:25 -0700] "DELETE /virtual/solutions/target/web+services HTTP/2.0" 203 26554
156.127.178.177 - okuneva5222 [21/Jun/2019:15:45:27 -0700] "DELETE /interactive/transparent/niches/revolutionize HTTP/1.1" 416 14701
100.32.205.59 - ortiz8891 [21/Jun/2019:15:45:28 -0700] "PATCH /architectures HTTP/1.0" 204 6048
"""
for match in pattern.finditer(data):
dct = match.groupdict()
print(dct)

并且会产生

{'host': '146.204.224.152', 'user_name': 'feest6811', 'time': '21/Jun/2019:15:45:24 -0700', 'request': 'POST /incentivize HTTP/1.1'}
{'host': '197.109.77.178', 'user_name': 'kertzmann3129', 'time': '21/Jun/2019:15:45:25 -0700', 'request': 'DELETE /virtual/solutions/target/web+services HTTP/2.0'}
{'host': '156.127.178.177', 'user_name': 'okuneva5222', 'time': '21/Jun/2019:15:45:27 -0700', 'request': 'DELETE /interactive/transparent/niches/revolutionize HTTP/1.1'}
{'host': '100.32.205.59', 'user_name': 'ortiz8891', 'time': '21/Jun/2019:15:45:28 -0700', 'request': 'PATCH /architectures HTTP/1.0'}

在这段代码中,我使用re来搜索模式,然后在字典unit_d中收集匹配项。列表完整列表包含所有词典。

import re
filename='c:/test/log.txt'
fulllist=[]
with open(filename) as file:
for line in file:
unit_d=dict()
text=line.rstrip()
finder=re.search('([d.]+)[s-]+(w+) [([w/: -]+)] "([^"]+)',text)
unit_d['host']=finder.group(1)
unit_d['user_name']=finder.group(2)
unit_d['time']=finder.group(3)
unit_d['request']=finder.group(4)
print unit_d
fulllist.append(unit_d)

结果

{'request': 'POST /incentivize HTTP/1.1', 'host': '146.204.224.152', 'user_name': 'feest6811', 'time': '21/Jun/2019:15:45:24 -0700'}
{'request': 'DELETE /virtual/solutions/target/web+services HTTP/2.0', 'host': '197.109.77.178', 'user_name': 'kertzmann3129', 'time': '21/Jun/2019:15:45:25 -0700'}
{'request': 'DELETE /interactive/transparent/niches/revolutionize HTTP/1.1', 'host': '156.127.178.177', 'user_name': 'okuneva5222', 'time': '21/Jun/2019:15:45:27 -0700'}
{'request': 'PATCH /architectures HTTP/1.0', 'host': '100.32.205.59', 'user_name': 'ortiz8891', 'time': '21/Jun/2019:15:45:28 -0700'}

最新更新