根据 n 元语法选择数据中的 ID/行



我有以下数据集:

ID       Text
12     Coolest fan we’ve ever seen.
12     SHARE this with anyone you know who can use this tip!
31     Time for a Royal Celebration! Save the date.
54     The way to a sports fan’s heart? Behind-the-scenes content from their favourite teams.
419    Start asking your questions now for tomorrow’s LIVE Q&A on careers you can do without going to university.
451    Save the date, we’re hosting a fabulous & fun meetup at Coffee Bar Bryant on 9/20. Stay tuned 

我使用ngrams来分析文本和单词/句子的频率。

from nltk import ngrams
text=df.Text.tolist()
list_n=[]

for i in text:
n_grams = ngrams(i.split(), 3)
for grams in n_grams:
list_n.append(grams)
list_n

由于我有兴趣查找特定单词/单词序列在哪个文本中使用,因此我需要在文本之间创建关联(即ID( 和带有特定 ngram 的文本。 例如:我有兴趣查找包含"Save the date"的文本,即ID=31ID=451. 为了找到一个单词的 n 元语法,我一直在使用这个:

def ngram_filter(col, word, n):
tokens = col.split()
all_ngrams = ngrams(tokens, n)
filtered_ngrams = [x for x in all_ngrams if word in x]
return filtered_ngrams

但是,我不知道如何找到与文本关联的ID以及如何在上面的函数中选择更多单词。

我该怎么做?知道吗?

如果需要,请随时更改标签。谢谢

我对ngrams没有太多经验,但你可以得到你想要的str.contains,比如:

import re
txt = 'save the date'
print (f'The ngrams "{txt}" is in IDs: ',
df.loc[df['Text'].str.contains(txt, flags=re.IGNORECASE), 'ID'].tolist())
The ngrams "save the date" is in IDs:  [31, 451]

另一种选择可能是这样做,但性能可能不好:

txt = 'save the date'
df_ = df.assign(ngrams=df.Text.str.replace(r'[^ws]+', '') #remove punctuation
.str.lower() # lower case
.str.split() #split over a whitespace
.apply(lambda x: list(ngrams(x, 3))))
.explode('ngrams') #create a row per ngram
print (df_.loc[df_['ngrams'].isin(ngrams(txt.lower().split(), 3))])
ID                                               Text             ngrams
2   31       Time for a Royal Celebration! Save the date.  (save, the, date)
5  451  Save the date, we’re hosting a fabulous & fun ...  (save, the, date)

最新更新