斯坦福NLP依赖解析器给了我中文字符作为问号

我正在尝试使用斯坦福依赖解析器解析 CoNLL 格式的中文数据edu/stanford/nlp/models/parser/nndep/CTB_CoNLL_params.txt.gz但我似乎有一些编码困难。

我的输入文件是utf-8，已经分成不同的单词，一个句子看起来像这样：那时的坎纳里鲁夫，有着西海岸最大的工业化罐头工厂。

我用来运行模型的命令如下：

java -mx2200m -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP 
    -language Chinese 
    -encoding utf-8 
    -props StanfordCoreNLP-chinese.properties 
    -annotators tokenize,ssplit,pos,depparse 
    -file ./ChineseCorpus/ChineseTestSegmented.txt 
    -outputFormat conll

除了没有正确编码中文字符之外，一切似乎都很好，这是我得到的输出：

1   ??  _   NT  _   2   DEP
2   ?   _   DEG _   4   NMOD
3   ??? _   NR  _   4   NMOD
4   ??  _   NR  _   6   SUB
5   ?   _   PU  _   6   P
6   ??  _   VE  _   0   ROOT
7   ??? _   NN  _   12  NMOD
8   ??  _   JJ  _   9   DEP
9   ?   _   DEG _   12  NMOD
10  ??? _   NN  _   12  NMOD
11  ??  _   NN  _   12  NMOD
12  ??  _   NN  _   6   OBJ
13  ?   _   PU  _   6   P

根据斯坦福解析器常见问题解答，中文的标准编码是GB18030，但他们也说"但是，解析器能够解析任何编码的文本，前提是你在命令行上传递正确的编码选项"，我做到了。

我看过这个问题：如何使用斯坦福词典解析器处理中文文本？但是他们使用 iconv 的解决方案对我不起作用，我cannot convert得到错误，我一直在尝试几种可能的编码组合。

有人建议出了什么问题吗？

尝试类似操作：

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP 
-language Chinese -props StanfordCoreNLP-chinese.properties 
-annotators segment,ssplit,pos,parse -file chinese-in.txt -outputFormat conll

例如：

alvas@ubi:~/stanford-corenlp-full-2015-12-09$ cat chinese-in.txt
那时的坎纳里鲁夫，有着西海岸最大的工业化罐头工厂。
alvas@ubi:~/jose-stanford/stanford-corenlp-full-2015-12-09$ 
> java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP 
> -language Chinese -props StanfordCoreNLP-chinese.properties 
> -annotators segment,ssplit,pos,parse -file chinese-in.txt -outputFormat conll 
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Registering annotator segment with class edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator segment
Loading classifier from edu/stanford/nlp/models/segmenter/chinese/ctb.gz ... [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file:
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary -   edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423200.
done [14.4 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger ... done [1.4 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[main] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz ... 
done [5.2 sec].
Processing file /home/alvas/jose-stanford/stanford-corenlp-full-2015-12-09/chinese-in.txt ... writing to /home/alvas/jose-stanford/stanford-corenlp-full-2015-12-09/chinese-in.txt.conll
Annotating file /home/alvas/jose-stanford/stanford-corenlp-full-2015-12-09/chinese-in.txt
[main] INFO edu.stanford.nlp.wordseg.TagAffixDetector - INFO: TagAffixDetector: useChPos=false | useCTBChar2=true | usePKChar2=false
[main] INFO edu.stanford.nlp.wordseg.TagAffixDetector - INFO: TagAffixDetector: building TagAffixDetector from edu/stanford/nlp/models/segmenter/chinese/dict/character_list and edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb
[main] INFO edu.stanford.nlp.wordseg.CorpusChar - Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese/dict/character_list
[main] INFO edu.stanford.nlp.wordseg.affDict - Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb
done.
Annotation pipeline timing information:
ChineseSegmenterAnnotator: 0.2 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.0 sec.
ParserAnnotator: 0.9 sec.
TOTAL: 1.2 sec. for 13 tokens at 11.0 tokens/sec.
Pipeline setup: 21.1 sec.
Total time for StanfordCoreNLP pipeline: 22.3 sec.

[输出]：

http://pastebin.com/raw/Y9J0UBDF

相关内容

最新更新

热门标签：