我目前正在将一个文件的文本与另一个文件中的文本进行比较。方法:对于源文本文件中的每一行,检查比较文本文件中每一行。如果该单词存在于比较文件中,则写入该单词并在其旁边写入"present"。如果单词不存在,则写下单词并在其旁边写not_present。到目前为止,我可以通过打印到控制台输出来完成这项工作,如下所示:
import sys
filein = 'source.txt'
compare = 'compare.txt'
source = 'source.txt'
# change to lower case
with open(filein,'r+') as fopen:
string = ""
for line in fopen.readlines():
string = string + line.lower()
with open(filein,'w') as fopen:
fopen.write(string)
# search and list
with open(compare) as f:
searcher = f.read()
if not searcher:
sys.exit("Could not read data :-(")
#search and output the results
with open(source) as f:
for item in (line.strip() for line in f):
if item in searcher:
print(item, ',present')
else:
print(item, ',not_present')
输出如下:
dog ,present
cat ,present
mouse ,present
horse ,not_present
elephant ,present
pig ,present
我想把它放入pandas数据帧中,最好是两列,一列表示单词,另一列表示其状态。我做这件事似乎没法动脑筋。
我在这里做了几个假设,包括:
-
Compare.txt是一个文本文件,由一个单词列表组成,每行一个单词。
-
Source.txt是一个自由流动的文本文件,每行包含多个单词,每个单词用空格分隔。
-
当进行比较以确定比较词是否在源中时,如果且仅当,源中的单词没有附加标点符号(即"、.?等(,则会发现is。
-
输出数据帧将只包含compare.txt.中的单词
-
最终输出是pandas数据帧的打印版本。
有了这些假设:
import pandas as pd
from collections import defaultdict
compare = 'compare.txt'
source = 'source.txt'
rslt = defaultdict(list)
def getCompareTxt(fid: str) -> list:
clist = []
with open(fid, 'r') as cmpFile:
for line in cmpFile.readlines():
clist.append(line.lower().strip('n'))
return clist
cmpList = getCompareTxt(compare)
if cmpList:
with open(source, 'r') as fsrc:
items = []
for item in (line.strip().split(' ') for line in fsrc):
items.extend(item)
print(items)
for cmpItm in cmpList:
rslt['Name'].append(cmpItm)
if cmpItm in items:
rslt['State'].append('Present')
else:
rslt['State'].append('Not Present')
df = pd.DataFrame(rslt, index=range(len(cmpList)))
print(df)
else:
print('No compare data present')