我正在尝试使用scikit-learn中的Tf-idf矢量器,使用NLTK中的西班牙语停词:
from nltk.corpus import stopwords
vectorizer = TfidfVectorizer(stop_words=stopwords.words("spanish"))
问题是我得到以下警告:
/home/---/.virtualenvs/thesis/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py:122: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
tokens = [w for w in tokens if w not in stop_words]
有简单的方法来解决这个问题吗?
实际上这个问题比我想象的要容易解决。这里的问题是NLTK不返回unicode对象,而是返回str对象。所以我需要在使用它们之前从utf-8解码它们:
stopwords = [word.decode('utf-8') for word in stopwords.words('spanish')]