我有一个数据集,其中包含csv格式的令牌列表,如下所示:
song, tokens
aaa,"['everyon', 'pict', 'becom', 'somebody', 'know']"
bbb,"['tak', 'money', 'tak', 'prid', 'tak', 'littl']"
首先,我想找到至少一定时间内出现在文本中的所有单词,比如说 5 个,这很容易做到:
# converters simply reconstruct the string of tokens in a list of tokens
tokens = pd.read_csv('dataset.csv',
converters={'tokens': lambda x: x.strip("[]").replace("'", "").split(", ")})
# List of all words
allwords = [word for tokens in darklyrics['tokens'] for word in tokens]
allwords = pd.DataFrame(allwords, columns=['word'])
more5 = allwords[allwords.groupby("word")["word"].transform('size') >= 5]
more5 = set(more5['word'])
frequentwords = [token.strip() for token in more5]
frequentwords.sort()
现在我想为每个标记列表删除那些出现在频繁单词中的标记,为此我使用以下代码:
def remove_non_frequent(x):
global frequentwords
output = []
for token in x:
if token in frequentwords:
output.append(token)
return output
def remove_on_chunk(df):
df['tokens'] = df.apply(lambda x: remove_non_frequent(x['tokens']), axis=1)
return df
def parallelize_dataframe(df, func, n_split=10, n_cores=4):
df_split = np.array_split(df, n_split)
pool = Pool(n_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
lyrics_reconstructed = parallelize_dataframe(lyrics, remove_on_chunk)
非多进程版本大约需要 2.30-3 小时来计算,而此版本需要 1 小时。
当然,这是一个缓慢的过程,因为我必须在 30k 元素列表中搜索大约 130 百万个代币,但我很确定我的代码不是特别好。
有没有更快、更好的方法来实现这样的事情?
进行Set
操作。我已将您的示例数据保存到"tt1"文件中,因此这应该有效。此外,如果您自己以某种方式生成数据,请帮自己一个忙并删除引号和方括号。这将节省您在预处理中的时间。
from collections import Counter
import re
rgx = re.compile(r"[[]"' n]") # data cleanup
# load and pre-process the data
counter = Counter()
data = []
with open('tt1', 'r') as o:
o.readline()
for line in o:
parts = line.split(',')
clean_parts = {re.sub(rgx, "", i) for i in parts[1:]}
counter.update(clean_parts)
data.append((parts[0], clean_parts))
n = 2 # <- here set threshold for number of occurences
common_words = {i[0] for i in counter.items() if i[1] > n}
# process the data
clean_data = []
for s, r in data:
clean_data.append((s, r - common_words))
已经有一段时间了,但我会发布问题的正确解决方案,thantk sto Marek,因为这只是对他的代码的轻微修改。 他使用无法处理重复项的集合,因此显而易见的想法是重用相同的代码,但使用多重集合。 我已经使用过这个实现 https://pypi.org/project/multiset/
from collections import Counter
import re
from multiset import Multiset
rgx = re.compile(r"[[]"' n]") # data cleanup
# load and pre-process the data
counter = Counter()
data = []
with open('tt1', 'r') as o:
o.readline()
for line in o:
parts = line.split(',')
clean_parts = [re.sub(rgx, "", i) for i in parts[1:]]
counter.update(clean_parts)
ms = Multiset()
for word in clean_parts:
ms.add(word)
data.append([parts[0], ms])
n = 2 # <- here set threshold for number of occurences
common_words = Multiset()
# I'm using intersection with the most common words since
# common_words is way smaller than uncommon_words
# Intersection returns the lowest value count between two multisets
# E.g ('sky', 10) and ('sky', 1) will produce ('sky', 1)
# I want the number of repeated words in my document so i set the
# common words counter to be very high
for item in counter.items():
if item[1] >= n:
common_words.add(item[0], 100)
# process the data
clean_data = []
for s, r in data:
clean_data.append((s, r.intersection(common_words)))
output_data = []
for s, ms in clean_data:
tokens = []
for item in ms.items():
for i in range(0, item[1]):
tokens.append(item[0])
output_data.append([s] + [tokens])
此代码提取最常用的单词,并根据此列表筛选每个文档,在 110 MB 数据集上,在不到 2 分钟的时间内执行作业。