我有一个数据集,其中一列包含数字和句子混合在一起,但是我想删除这些数据集并删除所有标点符号,删除所有停止字词,然后返回清洁文本的列表
我曾尝试使用Regex用空格替换数字。
import pandas as pd
import nltk
import re
df = pd.read_excel("samplefinal.xlsx")
df['comments'] = df['comments'].str.replace(r'd+','')
mess = df["comments"]
from nltk.corpus import stopwords
def text_process(mess):
nopunc = [char for char in mess if char not in string.punctuation]
nopunc = ''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in
stopwords.words('english')]
df["comments"].apply(text_process)
数据:
ID Name comments
28930 poil The host canceled this reservation 24
days
before arrival. This is an automated
posting.
7389 opil This apartment is very clean and is
perfect for 2, is 10 mins walking
from the Tabata
使用上述代码上的错误消息:'''
TypeError Traceback (most recent call
last)
<ipython-input-22-ab6d2299296f> in <module>
----> 1 df["comments"].apply(text_process)
~Anaconda3libsite-packagespandascoreseries.py in apply(self, func,
convert_dtype, args, **kwds)
3589 else:
3590 values = self.astype(object).values
->3591 mapped = lib.map_infer(values, f,
convert=convert_dtype)
3592
3593 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-21-971b567ffb47> in text_process(mess)
1 def text_process(mess):
----> 2 nopunc = [char for char in mess if char not in
string.punctuation]
3 #nopunc = [char for char in mess if char not in
string.punctuation]
4 nopunc = ''.join(nopunc)
5 return [word for word in nopunc.split() if word.lower() not in
stopwords.words('english')]
TypeError: 'float' object is not iterable
'''期望:
ID Name comments
28930 poil [host, canceled, reservation,
days,
before, arrival, automated
posting
7389 opil [apartment,clean,
perfect, mins, walking
Tabata
我可能会出现预期的输出,因为我不知道存在的所有停止词,但我希望您明白这一想法。请帮助!
我认为这与加载数据或使用text_process
有关,因为给定提供了我们的示例,您的原始代码可完美。
我尝试了:
import pandas as pd
import string
from nltk.corpus import stopwords
df = pd.DataFrame({'ID': [28930, 7389], 'Name': ['poil', 'opil'], 'comments': [
'The host canceled this reservation 24 days before arrival. This is an automated posting.',
'This apartment is very clean and is perfect for 2, is 10 mins walking from the Tabata']})
df['comments'] = df['comments'].str.replace(r'd+', '')
mess = df["comments"]
from nltk.corpus import stopwords
def text_process(mess):
nopunc = [char for char in mess if char not in string.punctuation]
nopunc = ''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in
stopwords.words('english')]
print("DATA:")
print(df)
print("RESULTS:")
for row in mess:
print(text_process(row))
并得到:
DATA:
ID Name comments
0 28930 poil The host canceled this reservation days befor...
1 7389 opil This apartment is very clean and is perfect fo...
RESULTS:
['host', 'canceled', 'reservation', 'days', 'arrival', 'automated', 'posting']
['apartment', 'clean', 'perfect', 'mins', 'walking', 'Tabata']
您可以将实际调用text_process
的代码添加到问题上吗?也许您以某种方式通过float arg而不是字符串消息?
也许添加print(type(mess))
作为功能的第一行,以查看何时是float