我正试图根据周围的单词/模式提取某些文本,并将信息输出到一个名为sample.csv的文件中。
例如,我有一个文件目录:
文件1.html文件2.htmlfile3.html
每个文件都包含以下结构。例如,file1.html:
<strong>Hello world</strong>
<p><strong>Name:</strong> John Smith</p>
<p>Some text</p>
<p><strong>Location</strong></p>
<blockquote>
<p>122 Main Street & City, ST 12345 ></p>
</blockquote>
<p>Some text</p>
基于上面的HTML结构,我想将其输出到一个sample.csv文件,如下所示:
filename,name,location
file1.html,John Smith,122 Main Street
file2.html,Mary Smith,123 North Road
file3.html,Kate Lee,90 Winter Lane
我有以下python代码:
import os
import csv
import re
csv_cont = []
directory = os.getcwd()
for root,dir,files in os.walk(directory):
for file in files:
if file.endswith(".html"):
f = open(file, 'r')
name = re.search('<p><strong>Name:</strong>(.*)</p>', f)
location = re.search('<p><strong>Location</strong></p><blockquote><p>(.*)&', f)
tmp = []
tmp.append(file)
tmp.append(name)
tmp.append(location)
csv_cont.append(tmp)
f.close()
#Change name of test.csv to whatever you want
with open("sample.csv", 'w', newline='') as myfile:
wr = csv.DictWriter(myfile, fieldnames = ["filename", "name", "location"], delimiter = ',')
wr.writeheader()
wr = csv.writer(myfile)
wr.writerows(csv_cont)
我得到以下错误:
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object
这里的问题是什么?
您需要读取文件并对其运行搜索。替换
f = open(file, 'r')
name = re.search('<p><strong>Name:</strong>(.*)</p>', f)
location = re.search('<p><strong>Location</strong></p><blockquote><p>(.*)&', f)
带有
f = open(file, 'r')
file_content = f.read()
name = re.search('<p><strong>Name:</strong>(.*)</p>', file_content).group(1)
location = re.search('<p><strong>Location</strong></p>nn<blockquote>n<p>(.*)&', file_content).group(1)
已更正:在搜索中使用file_content而不是f。
使用group((捕获
输出:
filename,name,location
file1.html, John Smith,122 Main Street