通过在 Python 中匹配正则表达式将非常大的文本拆分为文档



我有一个非常大的文件(~4 GB(,看起来像这样:

<P ID=000ajevz>
OBJECTIVE: Fludarabine, cyclophosphamide and rituximab (FCR) therapy for lymphoid malignancies has historically been associated with a low reported incidence of Pneumocystis jirovecii pneumonia (PJP). However, prophylaxis was routinely used in early studies,.............................
</P>
<P ID=000q5l5n>
SIMPLE SUMMARY: The role of rodents in the transmission of many diseases is widely known. Wild rats abundant in urban environments may transmit diseases to humans and other animals, including laboratory rodents used for biomedical research in research facilities,......
</P>

我正在尝试一次读取整个文件,然后使用正则表达式将其拆分:'<P ID=(w+)>(.*?)</P>'

将我的文件文本处理到发布列表中以应用 TFIDF。

我的代码如下所示:

import time
import re
import json
import string
import numpy as np
import collections
from collections import Counter
from nltk.corpus import stopwords

filename = 'corpus3.txt'
text = open(filename, encoding ='utf8').read() #read all text of file at once
doc = re.finditer(r'<P ID=(w+)>(.*?)</P>', text, re.S) #split text into documents by matching regular expression
STOPWORDS = set(stopwords.words('english')) #load stopwords from nltk
no_docs = 0 #number of processed documents
doc_freq = Counter() #document frequency of each term
col _freq = Counter() #collection frequency
doc_id = 0
id_dict = dict()
data = collections.defaultdict(list) #dictionary of key:word and values:(docid, termfrequency)
for (docid, text) in [(x.group(1), x.group(2)) for x in doc]:
no_docs += 1
text = text.lower()
#wordcount = Counter(text.split())
wordcount = Counter(re.split(r'W+', text))
#wordcount = Counter(word[:5] for word in text.split()) #5-stemming
doc_id +=1
id_dict[doc_id] = docid
for (word, count) in wordcount.items():
if word.isalpha():
if not word in STOPWORDS:
#word_stem = word[:5]
doc_freq[word] += 1
col_freq[word] += count
data[word].append((doc_id, count))
flattend = (item for tag in data.values() for item in tag) #[(docid, tf)]
posting = (item  for tag in flattend for item in tag ) #[docid, tf]
posting_list = list(posting)

但是当它尝试一次读取整个文件时,我不断收到内存错误。我在谷歌 colab 中尝试过,我的 colab 崩溃了。 我试图切片我的文件并只读取文件的 1/4,它给了我的内存错误。 我也尝试逐行阅读它,但随后我不知道如何遍历这些行并使用正则表达式进行拆分以使单个迭代器稍后处理。

你可以像这样做一个生成器

def read_until(f, close_tag):
line = f.readline()
while not line.startswith(close_tag):
yield line
line = f.readline()
def get_docid(line):
id_start = line.find("ID=") + 3
id_end = line.find(">")
return line[id_start:id_end]
def read_doc(f):
"""Split text into documents"""
text = ""
line = f.readline()
while line:
if line.startswith("<P"):
docid = get_docid(line)
# read until closing tag
for rest_line in read_until(f, "</P"):
text += rest_line
yield (docid, text)
text = ""
line = f.readline()

这使您可以逐行阅读,并且仍然可以按所需的方式拆分它。它之所以有效,是因为<P>标签始终位于自己的行上。您可以将代码中的 for 循环替换为

docfile = open(filename)
for docid, text in read_doc(docfile):
...

相关内容

最新更新