我在一个.txt文件中有一个大约10000个3克术语的列表。我想在倍数内匹配这些术语。GHC文件在一个目录下,并计算每个术语的出现次数。
其中一个文件看起来像这样:
ntdll.dll+0x1e8bd ntdlll.dll+0x11a7 ntdll.dll+0x1e6f4 kernel32.dll+0xaa7f kernel32.dll+0xb50b ntdlll.dll+0x1e8 bd ntdll+0x11a5 ntdll.dll+0X1a6f4 kernel32.dll+0xaa7f kernel 32.dll+0xb50 b ntdll.dll+0x1e8abd ntdll.dll+0x11a1a7 ntdll.dll+0x1e6f4 kernel32.dll+0xaa7 kernel32.dll+0xb50-b ntdll dll.dll+0x1e8bd ntdll.dll+0x11a7 ntdll.dll+0x1e6f4 kernel32.dll+0xaa7f kernel32.dll+0xb50bkernel32.dll+0xb511 kernel32.dll+0x16d4f
我希望在数据帧中得到这样的输出:
N_gram_term_1 N_gram_term_2 ............ N_gram_term_n
2 1 0
3 2 4
3 0 3
这里的第2行表示N_ gram_
第3行表示N_ gram_
如果我需要更清楚一些事情,请告诉我。
我相信您已经为此目的实现了,也许在sklearn中。不过,从零开始的一个简单实现是:
import sys
d = {} # dictionary that will have 1st key = file and 2 key = 3gram
for file in sys.argv[1:]: # These are all files to be analyzed
d[file] = {} # The value here is a nested dictionary
with open(file) as f: # Opening each file at a time
for line in f: # going through every row of the file
g = line.strip()
if g in d[file]:
d[file][g] +=1
else:
d[file][g] = 1
import pandas
print(pandas.DataFrame(d).T)