PYTHON:内存错误-多项式NB.partial_fit()-17k类

嗨，我是Python SKLearn和ML的新手。我在使用MultinomialNB部分拟合时遇到内存错误，我试图对DMOZ目录数据进行多标签分类

我的问题：

我做错了什么？是我记性不足还是数据错误
我用的方法对吗
我能做些什么来提高我的学徒吗

方法：

将DMOZ DB目录存储到MongoDB/TokuMX 中

{
  "_id": {
    "$oid": "54e758c91d41c804d8ace196"
  },
  "docs": [
    {
      "url": "http://www.awn.com/",
      "description": "Provides information resources to the international animation community. Features include searchable database archives, monthly magazine, web animation guide, the Animation Village, discussion forums and other useful resources.",
      "title": "Animation World Network"
    }
  ],
  "labels": [
    "Top",
    "Arts",
    "Animation"
  ]
}

在docs数组上迭代，并将docs元素传递到我的分类器函数中。

矢量器和分类器

    classifier = MultinomialNB()
    vectorizer = HashingVectorizer(
            stop_words='english', 
            strip_accents='unicode', 
            norm='l2'
         )

我的分类器函数

def classify(doc, labels, classifier, vectorizer, *args):
    r = requests.get(doc['url'], verify=False)
    print "Retrieving URL = {0}n".format(doc['url'])
    if r.status_code == 200:
        html = lxml.html.fromstring(r.text)
        doc['content'] = []

        tags = ['font', 'td', 'h1', 'h2', 'h3', 'p', 'title']
        for tag in tags:
            for x in html.xpath('//'+tag):
                try:
                    bag_of_words = nltk.word_tokenize(x.text_content())
                    pos_tagged = nltk.pos_tag(bag_of_words)
                    for word, pos in pos_tagged:
                        if pos[:2] == 'NN':
                            doc['content'].append(word)
                except AttributeError as e:
                    print e
        x_train = vectorizer.fit_transform(doc['content'])
        #if we are the first one to run partial_fit, pass all classes
        if len(args) == 1:
            classifier.partial_fit(x_train, labels, classes=args[0])
        else:
            classifier.partial_fit(x_train, labels)
        return doc

X： doc['content']由一个具有NOUNS的数组组成。（600）

Y： labels由上面显示的mongo文档中的一个带有标签的数组组成。（3）

类args[0]由数据库中具有所有（UNIQUE）labels的数组组成。（17490）

在四核笔记本电脑上的VirtualBox中运行，为VM分配了4gb内存。

17490的唯一标签是什么？每个标签和每个特征都有一个系数，这可能是你记忆错误的来源。

相关内容

最新更新

热门标签：