斯坦福·科伦普(Stanford Corenlp)在R:西班牙语不起作用



我开始使用r中的stanford corenlp软件包,以对西班牙语进行一些文本分析。因此,我尝试以下内容:

R
R version 3.2.2 (2015-08-14) -- "Fire Safety"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
  Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> install.packages("coreNLP")
Installing package into ‘/home/ach/R/x86_64-pc-linux-gnu-library/3.2’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
trying URL 'https://cran.rediris.es/src/contrib/coreNLP_0.4-1.tar.gz'
Content type 'application/x-gzip' length 17392 bytes (16 KB)
==================================================
downloaded 16 KB
* installing *source* package ‘coreNLP’ ...
** package ‘coreNLP’ successfully unpacked and MD5 sums checked
** R
** data
*** moving datasets to lazyload DB
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (coreNLP)
The downloaded source packages are in
    ‘/tmp/RtmpO3q77z/downloaded_packages’
> library(coreNLP)
> downloadCoreNLP(type="base")
trying URL 'http://nlp.stanford.edu/software//stanford-corenlp-full-2015-04-20.zip'
Content type 'application/zip' length 360824440 bytes (344.1 MB)
==================================================
downloaded 344.1 MB
[1] 0
> 
> downloadCoreNLP(type="spanish")
trying URL 'http://nlp.stanford.edu/software//stanford-spanish-corenlp-2015-01-08-models.jar'
Content type 'application/x-java-archive' length 25007256 bytes (23.8 MB)
==================================================
downloaded 23.8 MB
> initCoreNLP()
Searching for resource: config.properties
Adding annotator tokenize
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.7 sec].
Adding annotator lemma
Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [3.5 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [1.2 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [2.3 sec].
Initializing JollyDayHoliday for SUTime from classpath: edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/defs.sutime.txt
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.sutime.txt
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
Adding annotator parse
Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.4 sec].
Adding annotator dcoref
Adding annotator sentiment
> > sInes <- "Hola padre. Acabo de llegar a casa. Tengo ganas de cenar"
> annotation <- annotateString(sInes)
> token <- getToken(annotation)
> token[token$sentence==2,c(1:4,7)]
  sentence id  token  lemma POS
4        2  1  Acabo  Acabo NNP
5        2  2     de     de NNP
6        2  3 llegar llegar NNP
7        2  4      a      a  DT
8        2  5   casa   casa  FW
9        2  6      .      .   .

一切似乎都可以正常工作(据我所知,看不到任何错误),但它行不通。例如," casa"被标记为不正确的外语(FW)。

那么,有人对此有任何想法吗?

非常感谢

Agustin

您不仅需要下载西班牙语,还需要将令牌设置为西班牙语:

props.setProperty("tokenize.language", "es");

包装的作者最近进行了更新,使更改语言设置变得轻而易举。

# update to newest version of the package
devtools::install_github("statsmaths/coreNLP")
# download base library (mandatory):
coreNLP::downloadCoreNLP()
# download desired language library:
coreNLP::downloadCoreNLP(type="spanish")
# attach package
library(coreNLP)
# run initCoreNLP specifying your language of choice
initCoreNLP(type="spanish")

最新更新