分解列名,并对多个单词而不是一个单词使用wordnet.synsets()



我正试图为列名中的每个单词获取一个同义词列表。但是,当我运行wordnet.synsets((时,它将只处理带有一个单词的列名。我如何在多个单词上运行它,并像下面我想要的输出一样输出它?还有没有办法只显示前4个结果以获得更好的可读性?

代码

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import pandas as pd
df =  ['Unnamed 0',
'business id',
'name',
'postal code',
]
syns = {w : [] for w in df}
for k, v in syns.items():
for synset in wordnet.synsets(k):
for lemma in synset.lemmas():
if lemma.name() not in syns:
v.append(lemma.name())
pd.DataFrame([syns], columns = syns.keys())

电流输出:

Unnamed 0   business id   name                                                postal code
[]          []            [gens, figure, public_figure, epithet, call, i...   []

期望输出:

Unnamed 0               business id               name                            postal code
Unnamed[definitions],   business[definitions],    [gens, figure, public_figure]   postal[definitions],
0[definitions]          id[definitions]                                           code[definitions]

更简单、可用

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk
import pandas as pd
df =  ['Unnamed 0',
'business id',
'name',
'postal code',
]
df = pd.DataFrame(
{tuple([k, t]):pd.Series(np.unique([l.name() 
for s in wordnet.synsets(t) 
for l in s.lemmas() if "_" not in l.name()])).to_dict()
for k in df 
for t in nltk.word_tokenize(k)
}).fillna("")
df.columns.set_names(["sentance","word"],inplace = True)
df.loc[:4] # just first 5 matches...

只需将列表/dict理解更改为meet Panda格式{"colA":[1,2], "colB":[3,4]}

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk
import pandas as pd
df =  ['Unnamed 0',
'business id',
'name',
'postal code',
]
mr = max([len(k.split(" ")) for k in df])
pd.DataFrame(
# column for each requesed space delimited request
# use f-string to format as requested....
{k:[f"{v}:{np.unique([l.name() for s in wordnet.synsets(v) for l in s.lemmas() ]).tolist()}" 
# need to pad request with fewer tokend to meet pandas required format
for v in f"{k}{(mr-len(k.split(' ')))*' '}".split(" ")] 
for k in df}).replace({":[]":""})

输出

Unnamed 0   business id name    postal code
0   Unnamed:['nameless', 'unidentified', 'unknown'...   business:['business', 'business_concern', 'bus...   name:['advert', 'appoint', 'bring_up', 'call',...   postal:['postal']
1   0:['0', 'cipher', 'cypher', 'nought', 'zero']   id:['Gem_State', 'I.D.', 'ID', 'Idaho', 'id']       code:['cipher', 'code', 'codification', 'compu...

相关内容

最新更新