如何在词干和删除标点符号后按新行分割regex结果列表



生成的文件是两个非常长的单元素列表,所有处理过的文本都放在一起。我试图将list.append命令移动到if-else语句下,我得到了一个非常大的列表,每隔几个单词就被集中在一起,然后是相同的前一个单词,并在其中添加一些新词,直到我得到一个完整的句子,然后它开始对下一个匹配进行同样的操作。我相信它可以用一个更好的循环来解决。我也尝试过处理生成的文件,但效率很低,因为我不再有任何拆分它们的依据。这是否可能是";或";写入的正则表达式中的操作数?

import csv
import re 
import string
import nltk
from nltk.tokenize import punkt, word_tokenize 
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import langid
h=SnowballStemmer("hungarian") # hungarian stemmer
stop_words=set(stopwords.words("hungarian")) # - {"Nem,nem"} 
i=0.0
j=0.0
latin_counter=0.0
result=[]
result2=[]
tokenized_txt=[]
tokenized_txt_latin=[]
unstemmed_list=[]
auxlist=[]
stop_words_latin={'ab', 'ac', 'ad', 'adhic', 'aliqui', 'aliquis', 'an', 'ante', 'apud', 'at', 'atque',
'aut', 'autem', 'cum', 'cur', 'de', 'deinde', 'dum', 'ego', 'enim', 'ergo', 'es', 'est', 'et', 'etiam', 'etsi', 'ex', 'fio', 'haud', 
'hic', 'iam', 'idem', 'igitur', 'ille', 'in', 'infra', 'inter', 'interim', 'ipse', 'is', 'ita', 'magis', 'modo',
'mox', 'nam', 'ne', 'nec', 'necque', 'neque', 'nisi', 'non', 'nos', 'o', 'ob', 'per', 'possum', 'post', 'pro', 'quae', 'quam', 'quare', 'qui',
'quia', 'quicumque', 'quidem', 'quilibet', 'quis', 'quisnam', 'quisquam', 'quisque', 'quisquis', 'quo', 'quoniam', 'sed', 'si', 'sic',
'sive', 'sub', 'sui', 'sum', 'super', 'suus', 'tam', 'tamen', 'trans', 'tu', 'tum', 'ubi', 'uel', 'uero'}
with open('data/onkology.csv', 'r') as csv_file:
csv_reader= csv.reader(csv_file  ,delimiter=';') 
exp=(r'[l L]u.r[o i]nb|(w)*peptylb|(w)*lutamidb')
for line in csv_reader:
i+= 1
for lineElem in line: 

if  (re.search(exp,lineElem) and len(lineElem)>80) : 
result2.append(lineElem) # if we want to see what we matched 
tst_txt=lineElem
j+=1

#if(i >= 10000): 
#   break


for listElem in result2:
k,_ =langid.classify(listElem) 
if(k=='la'):
#print (tst_txt)
latin_counter+=1
words=word_tokenize(listElem)
# removing stop words 
for w in words:
if w not in stop_words_latin:
#Stemming and add to a list 
tokenized_txt_latin.append(w)
# removing punctuation 
tokenized_txt_latin = [word for word in tokenized_txt_latin if word.isalpha()]
words=' '.join(tokenized_txt_latin) # rejoining tokens to form a string 


else :
words=word_tokenize(listElem)
# removing stop words 
for w in words:
if w not in stop_words:
#Stemming and add to a list 
auxlist.append(w)
tokenized_txt.append(h.stem(w))
#unstemmed_list.append(words)
# removing punctuation 
auxlist = [word for word in auxlist if word.isalpha()]
words2=' '.join(auxlist) # rejoining tokens to form a string
tokenized_txt = [word for word in tokenized_txt if word.isalpha()]
words=' '.join(tokenized_txt) # rejoining tokens to form a string

result.append(words)
unstemmed_list.append(words2) 

print("Matching rate is :",  (j/i) )
print(unstemmed_list ,"n")
print(result,"n")
# write results to a file 
with open('listfile.txt', 'w') as filehandle:
for listitem in result:
filehandle.write('%sn' % listitem)
with open('listfile_unstemmed.txt', 'w') as filehandle:
for listitem in unstemmed_list:
filehandle.write('%sn' % listitem)

在另一台机器上运行代码(我将项目迁移到谷歌Colab(并比较结果后,我发现这是旧机器内存溢出的结果,与代码本身无关。

相关内容

  • 没有找到相关文章

最新更新