Python -在文本文件中搜索组字符串列表



我想在文本文件(.txt或.log)中搜索一组字符串列表。

  1. 必须包含A组或B组(或CDE..)
  2. A组或B组每个单词需要在同一行但不能靠近。(如。["123456", "Login"]或["123457", "Login"]如果在同一行,然后保存到一个新的txt文件。

一些示例输出行:

20221110,1668057560.965,AE111,123457,0,"Action=Account Login,XXX,XXX",XXX,XXX
20221110,1668057560.965,AE112,123458,0,"Action=Account Login,XXX,XXX",XXX,XXX
20221111,1668057560.965,AE113,123458,0,"Action=Order,XXX,XXX",XXX,XXX
下面是我的代码:
import os, re
path = "Log\"
file_list = [path + f for f in os.listdir(path) if f.endswith('.log')]

keep_phrases1 = ["123456", "Login"]
keep_phrases2 = ["123457", "Login"]
pat = r"b.*?b".join([re.escape(word) for word in keep_phrases1])
pat = re.compile(r"b" + pat + r"b")
pat2 = r"b.*?b".join([re.escape(word) for word in keep_phrases2])
pat2 = re.compile(r"b" + pat2 + r"b")
print(pat2,pat)
if len(file_list) != 0:

for infile in sorted(file_list):
with open(infile, encoding="latin-1") as f:
f = f.readlines()
for line in f:
found1 = pat.search(line)
found2 = pat2.search(line)
if found1 or found2:
with open(outfile, "a") as wf:
wf.write(line)

这对我来说是工作,但不容易添加更多的组词。我认为代码不是很好理解?

我的问题是如何简化代码?我怎样才能更容易地添加其他组来搜索?例如["123458", "Login"] ["123456", "order"] ["123457", "order"]

import os, re
path = "Log\"
file_list = [path + f for f in os.listdir(path) if f.endswith('.log')]

容器中的所有keep_phrases,我选择了一个字典,但由于它们是按顺序标识的,它可能是一个列表:

keep_phrases = {'keep_phrases1': ["123456", "Login"], 'keep_phrases2':["123457", "Login"]}
# Alternative, a list would work:
# keep_phrases = [["123456", "Login"], ["123457", "Login"]]

现在让我们用编译好的模式生成一个列表:

def compile_pattern(keep_phrase):
pat = r"b.*?b".join([re.escape(word) for word in keep_phrase])
pat = re.compile(r"b" + pat + r"b")
return pat
patterns = [compile_pattern(keep_phrases[keep_phrase]) for keep_phrase in keep_phrases.keys()]
# if keep_phrases had been a list, we would do
# patterns = [compile_pattern(keep_phrase) for keep_phrase in keep_phrases]

最后,我们查找每个模式的匹配,如果有任何发现,我们写入file.

if len(file_list) != 0:
for infile in sorted(file_list):
with open(infile, encoding="latin-1") as f:
f = f.readlines()
for line in f:
findings = [pat.search(line) for pat in patterns] # can do this because there's a list with patterns
if any(findings):
with open(outfile, "a") as wf:
wf.write(line)

试试,这个。我在一个字符串中读取整个文件,使代码快速可读,findall将返回一个列表,其中包含文件的所有匹配行。如果内存是一个问题,模式也可以在单独的行上工作:

import re
file_list=["sit.txt"]
keep_phrases=[["123456", "Login"],["123457", "Login"]]
pat = [r"(?:.*?(?:" + p1 + r"b.*?"+p2+r".*?(?:n|$)))" for p1,p2 in keep_phrases]
pat= r"|".join(pat)
for infile in sorted(file_list):
with open(infile, encoding="latin-1") as f:
text=f.read()
print(re.findall(pat,text))

不带regex

def match_words(line, words):
return all(word in words for word in line)
with open(infile, encoding="latin-1") as f:
f = f.readlines()
for line in f:
split_line = line.split(",")
if any( match_words(split_line , word) for word in [keep_phrases1, keep_phrases2]):
with open(outfile, "a") as wf:
wf.write(line)

最新更新