Scikit学习:如何从文本中提取特征



假设我有一个字符串数组:

['Laptop Apple Macbook Air A1465, Core i7, 8Gb, 256Gb SSD, 15"Retina, MacOS' ... 'another device description']

我想从这个描述中提取一些功能,比如:

item=Laptop
brand=Apple
model=Macbook Air A1465
cpu=Core i7
...

我应该先准备预定义的已知功能吗?像

brands = ['apple', 'dell', 'hp', 'asus', 'acer', 'lenovo']
cpu = ['core i3', 'core i5', 'core i7', 'intel pdc', 'core m', 'intel pentium', 'intel core duo']

我不确定我是否需要在这里使用CountVectorizerTfidfVectorizer,使用DictVictorizer更合适,但我如何制作具有从整个字符串中提取值的键的dict?

scikit learn的特征提取有可能吗?或者我应该自己制作.fit().transform()方法吗?

更新:@sergzach,请检查我是否理解正确:

data = ['Laptop Apple Macbook..', 'Laptop Dell Latitude...'...]
for d in data:
    for brand in brands:
       if brand in d:
          # ok brand is found
for model in models:
       if model in d:
          # ok model is found

那么,为每个特征创建N个循环呢?这可能是有效的,但不确定它是否正确和灵活。

是的,类似于下一个。

对不起,也许你应该更正下面的代码。

import re
data = ['Laptop Apple Macbook..', 'Laptop Dell Latitude...'...]
features = {
    'brand': [r'apple', r'dell', r'hp', r'asus', r'acer', r'lenovo'],
    'cpu': [r'cores+i3', r'cores+i5', r'cores+i7', r'intels+pdc', r'cores+m', r'intels+pentium', r'intels+cores+duo']
    # and other features
}
cat_data = [] # your categories which you should convert into numbers
not_found_columns = []
for line in data:
    line_cats = {}
    for col, features in features.iteritems():
        for i, feature in enumerate(features):
            found = False
            if re.findall(feature, line.lower(), flags=re.UNICODE) != []:
                line_cats[col] = i + 1 # found numeric category in column. For ex., for dell it's 2, for acer it's 5.               
                found = True
                break # current category is determined by a first occurence
        # cycle has been end but feature had not been found. Make column value as default not existing feature
        if not found:       
            line_cats[col] = 0
            not_found_columns.append((col, line))
        cat_data.append(line_cats)
# now we have cat_data where each column is corresponding to a categorial (index+1) if a feature had been determined otherwise 0.

现在您有了未找到的带有行(not_found_columns)的列名。查看它们,可能您忘记了一些功能。

我们也可以把字符串(而不是数字)写成类别,然后使用DV。因此,这些方法是等效的。

Scikit-Learn的矢量器将字符串数组转换为反向索引矩阵(2d数组,每个找到的词条/单词都有一列)。原始数组中的每一行(第一维)都映射到输出矩阵中的一行。每个单元格将包含一个计数或权重,具体取决于您使用的矢量器及其参数。

根据您的代码,我不确定这是否是您所需要的。你能告诉我你打算在哪里使用你正在寻找的这些功能吗?你打算训练一个分类器吗?目的是什么?

相关内容

  • 没有找到相关文章

最新更新