无法在带有nltk的regex的帮助下删除特殊字符



无法在带有nltk的regex的帮助下删除特殊字符

代码是

X, y = data.comments, data.sentiment

紧随其后的是

documents = []
from nltk.stem import WordNetLemmatizer
stemmer = WordNetLemmatizer()
for sen in range(0, len(X)):
# Remove all the special characters
document = re.sub(r'W', ' ', str(X[sen]))

# remove all single characters
document = re.sub(r's+[a-zA-Z]s+', ' ', document)

# Remove single characters from the start
document = re.sub(r'^[a-zA-Z]s+', ' ', document) 

# Substituting multiple spaces with single space
document = re.sub(r's+', ' ', document, flags=re.I)

# Removing prefixed 'b'
document = re.sub(r'^bs+', '', document)

# Converting to Lowercase
document = document.lower()

# Lemmatization
document = document.split()
document = [stemmer.lemmatize(word) for word in document]
document = ' '.join(document)

documents.append(document)

它返回的错误在下面给出

KeyError                                  Traceback (most recent call last)
~anaconda3libsite-packagespandascoreindexesbase.py in get_loc(self, key, method, tolerance)
3079             try:
-> 3080                 return self._engine.get_loc(casted_key)
3081             except KeyError as err:
pandas_libsindex.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas_libsindex.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas_libshashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas_libshashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 9
The above exception was the direct cause of the following exception:
KeyError                                  Traceback (most recent call last)
<ipython-input-16-f9320c54a8cb> in <module>
7 for sen in range(0, len(X)):
8     # Remove all the special characters
----> 9     document = re.sub(r'W', ' ', str(X[sen]))
10 
11     # remove all single characters
~anaconda3libsite-packagespandascoreseries.py in __getitem__(self, key)
822 
823         elif key_is_scalar:
--> 824             return self._get_value(key)
825 
826         if is_hashable(key):
~anaconda3libsite-packagespandascoreseries.py in _get_value(self, label, takeable)
930 
931         # Similar to Index.get_value, but we do not fall back to positional
--> 932         loc = self.index.get_loc(label)
933         return self.index._get_values_for_loc(self, loc, label)
934 
~anaconda3libsite-packagespandascoreindexesbase.py in get_loc(self, key, method, tolerance)
3080                 return self._engine.get_loc(casted_key)
3081             except KeyError as err:
-> 3082                 raise KeyError(key) from err
3083 
3084         if tolerance is not None:
KeyError: 9

我不知道为什么它不起作用,只是删除特殊字符代码不起作用。我试图清理数据,目标变量是y,问题案例是二进制分类。仅删除特殊字符无效。

将此行的索引从纯切片更改为iloc:

document = re.sub(r'W', ' ', str(X.iloc[sen]))

最新更新