我遇到了这个错误,称为"'float'对象不可迭代",即使我用空格替换了数据集中的所有数字



我有一个数据集,其中一列包含数字和句子混合在一起,但是我想删除这些数据集并删除所有标点符号,删除所有停止字词,然后返回清洁文本的列表

我曾尝试使用Regex用空格替换数字。

import pandas as pd
import nltk
import re
df = pd.read_excel("samplefinal.xlsx")
df['comments'] = df['comments'].str.replace(r'd+','')
mess = df["comments"]
from nltk.corpus import stopwords
def text_process(mess):
    nopunc = [char for char in mess if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    return [word for word in nopunc.split() if word.lower() not in
    stopwords.words('english')]

df["comments"].apply(text_process)

数据:

    ID             Name                    comments
   28930           poil              The host canceled this reservation 24 
                                    days 
                                    before arrival. This is an automated 
                                    posting.
   7389             opil            This apartment is very clean and is 
                                    perfect for 2,  is 10 mins walking 
                                    from the Tabata 

使用上述代码上的错误消息:'''

    TypeError                                 Traceback (most recent call 
    last)
    <ipython-input-22-ab6d2299296f> in <module>
    ----> 1 df["comments"].apply(text_process)
    ~Anaconda3libsite-packagespandascoreseries.py in apply(self, func, 
    convert_dtype, args, **kwds)
     3589             else:
     3590                 values = self.astype(object).values
   ->3591                 mapped = lib.map_infer(values, f, 
    convert=convert_dtype)
     3592 
     3593         if len(mapped) and isinstance(mapped[0], Series):
     pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
     <ipython-input-21-971b567ffb47> in text_process(mess)
           1 def text_process(mess):
     ----> 2  nopunc = [char for char in mess if char not in 
           string.punctuation]
           3 #nopunc = [char for char in mess if char not in 
           string.punctuation]
           4 nopunc = ''.join(nopunc)
           5 return [word for word in nopunc.split() if word.lower() not in 
            stopwords.words('english')]
            TypeError: 'float' object is not iterable

'''期望:

   ID             Name                    comments
   28930           poil             [host, canceled, reservation, 
                                    days,
                                    before, arrival, automated 
                                    posting
   7389             opil            [apartment,clean, 
                                    perfect, mins, walking 
                                    Tabata 

我可能会出现预期的输出,因为我不知道存在的所有停止词,但我希望您明白这一想法。请帮助!

我认为这与加载数据或使用text_process有关,因为给定提供了我们的示例,您的原始代码可完美。

我尝试了:

import pandas as pd
import string
from nltk.corpus import stopwords
df = pd.DataFrame({'ID': [28930, 7389], 'Name': ['poil', 'opil'], 'comments': [
    'The host canceled this reservation 24 days before arrival. This is an automated posting.',
    'This apartment is very clean and is perfect for 2,  is 10 mins walking from the Tabata']})
df['comments'] = df['comments'].str.replace(r'd+', '')
mess = df["comments"]
from nltk.corpus import stopwords

def text_process(mess):
    nopunc = [char for char in mess if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    return [word for word in nopunc.split() if word.lower() not in
            stopwords.words('english')]

print("DATA:")
print(df)
print("RESULTS:")
for row in mess:
    print(text_process(row))

并得到:

DATA:
      ID  Name                                           comments
0  28930  poil  The host canceled this reservation  days befor...
1   7389  opil  This apartment is very clean and is perfect fo...
RESULTS:
['host', 'canceled', 'reservation', 'days', 'arrival', 'automated', 'posting']
['apartment', 'clean', 'perfect', 'mins', 'walking', 'Tabata']

您可以将实际调用text_process的代码添加到问题上吗?也许您以某种方式通过float arg而不是字符串消息?

也许添加print(type(mess))作为功能的第一行,以查看何时是float

最新更新