我并不是要构建一个全新的朴素贝叶斯分类器。已经有很多了,例如scitkit learn有朴素贝叶斯实现,NLTK有自己的朴素贝叶斯分类器。
我的语言(印度语之一)有1000多个句子用于训练,300多个句子用于测试。我所需要做的就是选择一个分类器(实现了朴素贝叶斯),训练它并测试它的准确性。
问题是文本不是英文的,而是Devnagari unicode。
我正在寻求关于哪个分类器适合掩盖我到目前为止遇到的主要问题是unicode的建议。
scikit-learn中的朴素贝叶斯处理数字向量,例如,我们可以在一些向量器之后得到数字向量。对于文本分类,我经常使用TfidfVectorizer: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
在构造函数TfidfVectorizer的参数中存在下一个参数:编码:字符串,默认为' utf-8 '。如果要分析字节或文件,则使用此编码进行解码。
您可以使用此参数并使用您的编码,也可以指定您自己的预处理器函数和分析函数(它也可以是有用的)
最好使用scikit-learn
来构建语言ID系统,就像我以前所做的那样,参见https://github.com/alvations/bayesline。
也就是说,使用简单的分类模块从NLTK
和unicode数据构建语言ID系统是完全可能的。
不需要对NLTK代码做任何特殊的处理,它们可以按原样使用。(对于如何在NLTK中构建分类器,这可能对您有用:NLTK NaiveBayesClassifier训练用于情感分析)
现在,为了展示完全可以使用NLTK开箱即用的语言ID和unicode数据,请参见下面的
首先对于语言ID,在特征提取中使用unicode字符特征和字节码有一个微小的区别:
from nltk.corpus import indian
# NLTK reads the corpus as bytecodes.
hindi = " ".join(indian.words('hindi.pos'))
bangla = " ".join(indian.words('bangla.pos'))
marathi = " ".join(indian.words('marathi.pos'))
telugu = " ".join(indian.words('telugu.pos'))
# Prints out first 10 bytes (including spaces).
print 'hindi:', hindi[:10]
print 'bangla:', bangla[:10]
print 'marathi:', marathi[:10]
print 'telugu:', telugu[:10]
print
# Converts bytecodes to utf8.
hindi = hindi.decode('utf8')
bangla = bangla.decode('utf8')
marathi = marathi.decode('utf8')
telugu = telugu.decode('utf8')
# Prints out first 10 unicode char (including spaces).
print 'hindi:', hindi[:10]
print 'bangla:', bangla[:10]
print 'marathi:', marathi[:10]
print 'telugu:', telugu[:10]
print
[出]:
hindi: पूर
bangla: মহি
marathi: '' सन
telugu: 4 . ఆడ
hindi: पूर्ण प्रत
bangla: মহিষের সন্
marathi: '' सनातनवा
telugu: 4 . ఆడిట్
现在您看到了使用字节码和unicode的区别,让我们训练一个标记器。
from nltk import NaiveBayesClassifier as nbc
# Allocate some sort of labels for the data.
training = [(hindi, 'hi'), (bangla, 'ba'), (marathi, 'ma'), (telugu, 'te')]
# This is how you can extract ngrams
print ngrams(telugu[:10], 2)
print
print ngrams(hindi[:10], 3)
print
vocabulary = set(chain(*[ngrams(txt, 2) for txt,tag in training]))
feature_set = [({i:(i in ngrams(sentence, 2)) for i in vocabulary},tag) for sentence, tag in training]
classifer = nbc.train(feature_set)
test1 = u'पूर्ण प्रत' # hindi
test2 = u'মহিষের সন্' # bangla
test3 = u'सनातनवा' # marathi
test4 = u'ఆడిట్ ' # telugu
for testdoc in [test1, test2, test3, test4]:
featurized_test_sent = {i:(i in ngrams(testdoc,2)) for i in vocabulary}
print "test sent:", testdoc
print "tag:", classifer.classify(featurized_test_sent)
print
[出]:
[(u'4', u' '), (u' ', u'.'), (u'.', u' '), (u' ', u'u0c06'), (u'u0c06', u'u0c21'), (u'u0c21', u'u0c3f'), (u'u0c3f', u'u0c1f'), (u'u0c1f', u'u0c4d'), (u'u0c4d', u' ')]
[(u'u092a', u'u0942', u'u0930'), (u'u0942', u'u0930', u'u094d'), (u'u0930', u'u094d', u'u0923'), (u'u094d', u'u0923', u' '), (u'u0923', u' ', u'u092a'), (u' ', u'u092a', u'u094d'), (u'u092a', u'u094d', u'u0930'), (u'u094d', u'u0930', u'u0924')]
test sent: पूर्ण प्रत
tag: hi
test sent: মহিষের সন্
tag: ba
test sent: सनातनवा
tag: ma
test sent: ఆడిట్
tag: te
完整代码:
# -*- coding: utf-8 -*-
from itertools import chain
from nltk.corpus import indian
from nltk.util import ngrams
from nltk import NaiveBayesClassifier as nbc
# NLTK reads the corpus as bytecodes.
hindi = " ".join(indian.words('hindi.pos'))
bangla = " ".join(indian.words('bangla.pos'))
marathi = " ".join(indian.words('marathi.pos'))
telugu = " ".join(indian.words('telugu.pos'))
# Prints out first 10 bytes (including spaces).
print 'hindi:', hindi[:10]
print 'bangla:', bangla[:10]
print 'marathi:', marathi[:10]
print 'telugu:', telugu[:10]
print
# Converts bytecodes to utf8.
hindi = hindi.decode('utf8')
bangla = bangla.decode('utf8')
marathi = marathi.decode('utf8')
telugu = telugu.decode('utf8')
# Prints out first 10 unicode char (including spaces).
print 'hindi:', hindi[:10]
print 'bangla:', bangla[:10]
print 'marathi:', marathi[:10]
print 'telugu:', telugu[:10]
print
# Allocate some sort of labels for the data.
training = [(hindi, 'hi'), (bangla, 'ba'), (marathi, 'ma'), (telugu, 'te')]
# This is how you can extract ngrams
print ngrams(telugu[:10], 2)
print
print ngrams(hindi[:10], 3)
print
vocabulary = set(chain(*[ngrams(txt, 2) for txt,tag in training]))
feature_set = [({i:(i in ngrams(sentence, 2)) for i in vocabulary},tag) for sentence, tag in training]
classifer = nbc.train(feature_set)
test1 = u'पूर्ण प्रत' # hindi
test2 = u'মহিষের সন্' # bangla
test3 = u'सनातनवा' # marathi
test4 = u'ఆడిట్ ' # telugu
for testdoc in [test1, test2, test3, test4]:
featurized_test_sent = {i:(i in ngrams(testdoc,2)) for i in vocabulary}
print "test sent:", testdoc
print "tag:", classifer.classify(featurized_test_sent)
print
这个问题表述得很差,但有可能是关于语言识别而不是句子分类。
如果是这种情况,那么在应用朴素贝叶斯或其他分类器之前还有很长的路要走。看看Damir Cavar的LID使用的字符图方法,它是用Python实现的。