python 将下一个字符串抓取到给定的字符串



我有+1000个txt文件要抓取(Python(。我已经创建了列出所有.txt文件路径的file_list变量。我有五个字段要抓取:file_form、日期、公司、公司 ID 和价格范围。对于前四个变量,我没有问题,因为它们在每个.txt文件的开头都非常结构化地在单独的行中:

FILE FORM:      10-K
DATE:           20050630
COMPANY:        APPLE INC
COMPANY CIK:    123456789

我对这四个代码使用了以下代码:

import sys, os, re
exemptions=[]    
for eachfile in file_list:
line2 = ""  # for the following loop I need the .txt in lines. Right now, the file is read one in all. Create var with lines
with open(eachfile, 'r') as f:
for line in f:
line2 = line2 + line  # append each line. Shortcut: "line2 += line"
if "FILE FORM" in line:
exemptions.append(line.strip('n').replace("FILE FORM:", "")) #append line stripping 'S-1n' from field in + replace FILE FORM with blanks
elif "COMPANY" in line:
exemptions.append(line.rstrip('n').replace("COMPANY:", ""))  # rstrip=strips trailing characters 'n'
elif "DATE" in line:
exemptions.append(line.rstrip('n').replace("DATE:", ""))  # add field 
elif "COMPANY CIK" in line:
exemptions.append(line.rstrip('n').replace("COMPANY CIK:", ""))  # add field
print(exemptions)

这些为我提供了一个列表exemptions其中包含所有相关值,如上例所示。但是,"价格范围"字段位于.txt文件的中间,句子如下:

We anticipate that the initial public offering price will be between $         and
$         per share.

而且我不知道如何将$whateveritis;and $whateveritis;per share.作为我的最后一个第五个变量。好消息是,很多文件使用相同的结构,有时我$amounts的不是"&nbsp"。示例:We anticipate that the initial public offering price will be between $12.00 and $15.00  per share..

我想要这个"12.00;和;15.00"作为exemptions列表中的第五个变量(或者类似的东西,我以后可以轻松地在csv文件中工作(。

提前非常感谢你。

看起来您已经导入了正则表达式,那么为什么不使用它呢?像$[d.]+ and $[d.]+这样的正则表达式应该与价格相匹配,然后你可以轻松地从那里优化它:

import sys, os, re
exemptions=[]    
for eachfile in file_list:
line2 = ""
with open(eachfile, 'r') as f:
for line in f:
line2 = line2 + line
m = re.search('$[d.]+ and $[d.]+', line)
if "FILE FORM" in line:
.
.
.
elif m:
exemptions.append(m.group(0))   # m.group(0) will be the first occurrence and you can refine it from there
print(exemptions)

最新更新