如何对数据帧列进行词形还原 Python



如何对数据帧列进行词形还原。CSV 文件"train.csv"看起来像这样

id  tweet
1   retweet if you agree
2   happy birthday your majesty
3   essential oils are not made of chemicals

我执行了以下操作

import pandas as pd
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
train_data = pd.read_csv('train.csv', error_bad_lines=False)
print(train_data)
# Removing stop words
stop = stopwords.words('english')
test = pd.DataFrame(train_data['tweet'])
test.columns = ['tweet']
test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
print(test['tweet_without_stopwords'])
# TOKENIZATION
tt = TweetTokenizer()
test['tokenised_tweet'] = test['tweet_without_stopwords'].apply(tt.tokenize)
print(test)

输出:

0 retweet if you agree ... [retweet, agree]
1 happy birthday your majesty ... [happy, birthday, majesty]
2 essential oils are not made of chemicals ... [essential, oils, made, chemicals]

我尝试了以下内容来词形图化,但我收到此错误类型错误:不可哈希类型:"列表">


lmtzr = WordNetLemmatizer()
lemmatized = [[lmtzr.lemmatize(word) for word in test['tokenised_tweet']]]
print(lemmatized)

我会在数据帧本身上进行计算:

改变:

lmtzr = WordNetLemmatizer()
lemmatized = [[lmtzr.lemmatize(word) for word in test['tokenised_tweet']]]
print(lemmatized)
lmtzr = WordNetLemmatizer()
test['lemmatize'] = test['tokenised_tweet'].apply(
lambda lst:[lmtzr.lemmatize(word) for word in lst])

完整代码:

from io import StringIO
import pandas as pd
data=StringIO(
"""id;tweet
1;retweet if you agree
2;happy birthday your majesty
3;essential oils are not made of chemicals"""
)
test = pd.read_csv(data,sep=";")
import pandas as pd
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
# Removing stop words
stop = stopwords.words('english')
test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
print(test['tweet_without_stopwords'])
# TOKENIZATION
tt = TweetTokenizer()
test['tokenised_tweet'] = test['tweet_without_stopwords'].apply(tt.tokenize)
print(test)
lmtzr = WordNetLemmatizer()
test['lemmatize'] = test['tokenised_tweet'].apply(
lambda lst:[lmtzr.lemmatize(word) for word in lst])
print(test['lemmatize'])

输出

0                    [retweet, agree]
1          [happy, birthday, majesty]
2    [essential, oil, made, chemical]
Name: lemmatize, dtype: object

仅供将来参考,而不是恢复旧线程。

这是我的做法,它可以改进,但它有效:

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
df['Summary'] = df['Summary'].apply(lemmatize_text)
df['Summary'] = df['Summary'].apply(lambda x : " ".join(x))

'''

将 DF 列的名称更改为您选择的,基本上这会标记每个文本,对它们进行词形还原,并在完成后重新加入它们。

最新更新