根据 n 元语法选择数据中的 ID/行

我有以下数据集：

ID       Text
12     Coolest fan we’ve ever seen.
12     SHARE this with anyone you know who can use this tip!
31     Time for a Royal Celebration! Save the date.
54     The way to a sports fan’s heart? Behind-the-scenes content from their favourite teams.
419    Start asking your questions now for tomorrow’s LIVE Q&A on careers you can do without going to university.
451    Save the date, we’re hosting a fabulous & fun meetup at Coffee Bar Bryant on 9/20. Stay tuned

我使用ngrams来分析文本和单词/句子的频率。

from nltk import ngrams
text=df.Text.tolist()
list_n=[]

for i in text:
n_grams = ngrams(i.split(), 3)
for grams in n_grams:
list_n.append(grams)
list_n

由于我有兴趣查找特定单词/单词序列在哪个文本中使用，因此我需要在文本之间创建关联(即ID( 和带有特定 ngram 的文本。例如：我有兴趣查找包含"Save the date"的文本，即ID=31和ID=451. 为了找到一个单词的 n 元语法，我一直在使用这个：

def ngram_filter(col, word, n):
tokens = col.split()
all_ngrams = ngrams(tokens, n)
filtered_ngrams = [x for x in all_ngrams if word in x]
return filtered_ngrams

但是，我不知道如何找到与文本关联的ID以及如何在上面的函数中选择更多单词。

我该怎么做？知道吗？

如果需要，请随时更改标签。谢谢

我对ngrams没有太多经验，但你可以得到你想要的str.contains，比如：

import re
txt = 'save the date'
print (f'The ngrams "{txt}" is in IDs: ',
df.loc[df['Text'].str.contains(txt, flags=re.IGNORECASE), 'ID'].tolist())
The ngrams "save the date" is in IDs:  [31, 451]

另一种选择可能是这样做，但性能可能不好：

txt = 'save the date'
df_ = df.assign(ngrams=df.Text.str.replace(r'[^ws]+', '') #remove punctuation
.str.lower() # lower case
.str.split() #split over a whitespace
.apply(lambda x: list(ngrams(x, 3))))
.explode('ngrams') #create a row per ngram
print (df_.loc[df_['ngrams'].isin(ngrams(txt.lower().split(), 3))])
ID                                               Text             ngrams
2   31       Time for a Royal Celebration! Save the date.  (save, the, date)
5  451  Save the date, we’re hosting a fabulous & fun ...  (save, the, date)

相关内容

最新更新

热门标签：