我正在使用此代码提取本地存储的HTML文件的一部分,并将缩短的新文档保存到.txt文件中。
import glob
import os
import re
def extractor():
os.chdir(r"F:Test") # the directory containing your html
for file in glob.iglob("*.html"): # iterates over all files in the directory ending in .html
with open(file, encoding="utf8") as f, open((file.rsplit(".", 1)[0]) + ".txt", "w", encoding="utf8") as out:
contents = f.read()
extract = re.compile(r'(Start).*?End', re.I | re.S)
cut = extract.sub('', contents)
if re.search(extract, contents) is not None:
out.write(cut)
out.close()
extractor()
它适用于我的大多数文件,但是对于一些文件,我确实有一些编码问题并得到:
Traceback (most recent call last):
File "C:/Users/6930p/PycharmProjects/untitled/Versuch/CutFile.py", line 16, in <module>
extractor()
File "C:/Users/6930p/PycharmProjects/untitled/Versuch/CutFile.py", line 14, in extractor
out.write(cut)
File "C:Users6930pAnaconda3libencodingscp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 241205-241210: character maps to <undefined>
有人知道有什么问题吗?我认为通过使用encoding="utf8"
我不会有任何问题编码......
任何帮助表示赞赏!
好的,这是encoding="utf8"
的问题。它忘记用"utf8"
编码我新创建的.txt文件。代码已更新并正常工作!