我有情况,我必须从文本语料库中删除特定单词的umigram,同时维护该词的双gram以及该词的umigrams。
我试图将文本地址数据(Excel中的列)以及其他一些数值功能传递给分类算法。我需要计算文本数据并过滤特定的Uni-gram并将其附加到数据框架上,以便分类器算法可以理解它。
** sample data in Text Column**
TAJ MAHAL
TAJ MALABAR KOCHI
TAJ MALABAR KOCHI
TAJ RESIDENCY TVM
LEELA PALACE
PALACE ROAD
HILL VIEW ROAD
HILL AVENUE
HILL STATION
对于泰姬陵和希尔,我只想要大型和trigrams,我想要的所有单词我想要的单词,bigrams and Trigrams。
**输出bigram和umigram **
TAJ MAHAL
TAJ MALABAR
MALABAR KOCHI
TAJ RESIDENCY
KOCHI
LEELA
PALACE
LEELA PALACE
PALACE ROAD
HILL VIEW
HILL AVENUE
HILL STATION
当我尝试将stopwords用作泰姬陵和希尔时,也不会生成
的bigram和trigrams cv = CountVectorizer( max_features = 200,analyzer='word',ngram_range=(1, 3))
cv_txt = cv.fit_transform(data.pop('Txt'))
for i, col in enumerate(cv.get_feature_names()):
data[col] = pd.SparseSeries(cv_txt[:, i].toarray().ravel(), fill_value=0)
过滤特定的摘录后,我希望将它们附加到数据框架上,以便我可以运行分类算法。最终输出是Countalized文本数据的稀疏矩阵
如果您只想删除特定的摘要,则必须使用掩码从转换的数据中删除它们。如果将其用于比一个ORT ONE ARNAGELY更为复杂的事物,我建议编写包装班来管理它,否则将很难保持跟踪。
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
X = """TAJ MAHAL
TAJ MALABAR KOCHI
TAJ MALABAR KOCHI
TAJ RESIDENCY TVM
LEELA PALACE
PALACE ROAD
HILL VIEW ROAD
HILL AVENUE
HILL STATION"""
X = X.split('n')
df = pd.DataFrame(dict(txt=X))
cv = CountVectorizer(max_features = 200, analyzer='word', ngram_range=(1, 3))
cv.fit(df.txt)
feat_name = cv.get_feature_names()
#List of unigrams to remove (will work for ngrams too)
remove_list = ['taj', 'hill']
# This is the mask of features you want to keep
keep_mask = ~np.in1d(feat_name, remove_list)
# before the mask
X_transformed = cv.transform(df.txt)
print(X_transformed.shape)
# after the mask
X_transformed = X_transformed[:, keep_mask]
print(X_transformed.shape)
编辑更新的问题
# code to do the pandas merge
feat_name = np.array(feat_name)[keep_mask]
df_2 = pd.SparseDataFrame(data=X_transformed,
columns=feat_name,
default_fill_value=0)
df_merge = df.merge(df_2, left_index=True, right_index=True)
输出:
(9, 13)
(9, 11)
要在一个整洁的数据框架中获取此功能,只有