NLP停止词去除、词干和引理化



def clean_text(text(:#获取英语停止语english_stopwords=设置(stopcwords.words('english'((

# change to lower case and remove punctuation
#text = text.lower().translate(str.maketrans('', '', string.punctuation))
text = text.map(lambda x: x.lower().translate(str.maketrans('', '', string.punctuation)))
# divide string into individual words
def custom_tokenize(text):
if not text:
#print('The text to be tokenized is a None type. Defaulting to blank string.')
text = ''
return word_tokenize(text)
token = df['transcription'].apply(custom_tokenize)
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
clean_tokens = []
for tok in tokens:
tok = tok.strip("#") 
#tok = tok.strip() # remove space
if tok not in english_stopwords:
clean_tok = lemmatizer.lemmatize(tok) # lemmatizition
clean_tok = stemmer.stem(clean_tok) # Stemming
clean_tokens.append(clean_tok)
return " ".join(clean_tokens)
22     #tok = [[tok for tok in sent if tok not in stop] for sent in text]
23     for tok in tokens:

--->24tok=tok.条("#"(25#tok=tok.strip((#删除空格26如果tok不在英语中_stopwords:

AttributeError:"list"对象没有属性"strip">

我一直在得到这个;AttributeError:"list"对象没有属性"strip">

正如它所说,您正试图剥离一个列表。你只能剥去绳子。这就是Python向您抛出错误的原因。

您可能混淆了变量"token"one_answers"tokens"吗?

引理化已经处理了词干,所以不必同时处理这两个问题。

填词可能会改变单词的意思。例如,"pie"one_answers"pies"将改为"pi",但引理保留了含义并标识了词根"pie’"。

假设您的数据在pandas数据帧中。因此,如果你正在为NLP问题预处理文本数据,下面是我的解决方案,可以用一种更优雅的方式进行停止词删除和旅名化:

import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.utils import lemmatize
nltk.download('stopwords') # comment out if already downloaded
nltk.download('punkt')     # comment out if already downloaded
df = pd.read_csv('/path/to/text_dataset.csv')
# convert to lower case
df = df.apply(lambda x: x.str.lower())
# replace special characters (preserving only space)
df = df.apply(lambda x: [re.sub('[^a-z0-9]', ' ', i) for i in x])
# tokenize columns 
df = df.apply(lambda x:[word_tokenize(i) for i in x])
# remove stop words from token list in each column
df = df.apply(
lambda x: [
[ w for w in tokenlist if w not in stopwords.words('english')] 
for tokenlist in x])
# lemmatize columns
# the lemmatize method may fail during the first 3 to 4 iterations, 
# so try running it several times
for attempt in range(1, 11):
try:
print(f'Lemmatize attempt: {attempt}')
df = df.apply(
lambda x: [ [  l.decode('utf-8').split('/', 1)[0]        
for word in tokenlist for l in lemmatize(word) ]
for tokenlist in x])
print(f'Attempt {attempt} success!')
break
except:
pass

gensim.utils需要patterns包来运行lemmatize()。如果你还没有,请使用安装

pip install pattern

Gensim-lematizer给出了一个二进制字符串列表作为输出,以及它的pos(词性(标签。例如,"finding"将转换为[b'find/VB']。我添加了一个额外的循环来将二进制字符串转换为文本字符串,并删除pos标记。

如果您在某些列中有非文本数据,请应用以下转换:

textcols = ['column1', 'column2', 'column3']
df[textcols] = df[textcols].apply(lambda x: ... )

注意:如果您只将这些应用于一列,这是修改后的版本。

df['column'] = df['column'].apply(lambda x: x.lower())
df['column'] = df['column'].apply(lambda x: re.sub('[^a-z0-9]', ' ', x))
df['column'] = df['column'].apply(lambda x: word_tokenize(x))
df['column'] = df['column'].apply(
lambda x: [ token for token in x 
if token not in stopwords.words('english')] )
for attempt in range(1, 11):
try:
print(f'Lemmatize attempt: {attempt}')
df['column'] = df['column'].apply(
lambda x: [l.decode('utf-8').split('/', 1)[0] 
for word in x for l in lemmatize(word)])
print(f'Attempt {attempt} success!')
break
except:
pass

最新更新