如何从树库为斯坦福NLP训练一个新的解析器模型



我已经下载了UPDT波斯树库(乌普萨拉波斯依赖树库),我正在尝试使用斯坦福NLP从中构建依赖解析器模型。我尝试使用命令行和 Java 代码训练模型,但在这两种情况下我都遇到异常。

1-使用命令行训练模型:

java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser -train UPDTtrain.conll 0 -saveToSerializedFile UPDTupdt.model.ser.gz

当我运行上述命令时,我将收到此异常:

done [read 26 trees]. Time elapsed: 0 ms
Options parameters:
useUnknownWordSignatures 0
smoothInUnknownsThreshold 100
smartMutation false
useUnicodeType false
unknownSuffixSize 1
unknownPrefixSize 1
flexiTag false
useSignatureForKnownSmoothing false
wordClassesFile null
parserParams edu.stanford.nlp.parser.lexparser.EnglishTreebankParserParams
forceCNF false
doPCFG true
doDep true
freeDependencies false
directional true
genStop true
distance true
coarseDistance false
dcTags true
nPrune false
Train parameters:
 smooth=false
 PA=true
 GPA=false
 selSplit=false
 (0.0)
 mUnary=0
 mUnaryTags=false
 sPPT=false
 tagPA=false
 tagSelSplit=false (0.0)
 rightRec=false
 leftRec=false
 collinsPunc=false
 markov=false
 mOrd=1
 hSelSplit=false (10)
 compactGrammar=0
 postPA=false
 postGPA=false
 selPSplit=false (0.0)
 tagSelPSplit=false (0.0)
 postSplitWithBase=false
 fractionBeforeUnseenCounting=0.5
 openClassTypesThreshold=50
 preTransformer=null
 taggedFiles=null
 predictSplits=false
 splitCount=1
 splitRecombineRate=0.0
 simpleBinarizedLabels=false
 noRebinarization=false
 trainingThreads=1
 dvKBest=100
 trainingIterations=40
 batchSize=25
 regCost=1.0E-4
 qnIterationsPerBatch=1
 qnEstimates=15
 qnTolerance=15.0
 debugOutputFrequency=0
 randomSeed=0
 learningRate=0.1
 deltaMargin=0.1
 unknownNumberVector=true
 unknownDashedWordVectors=true
 unknownCapsVector=true
 unknownChineseYearVector=true
 unknownChineseNumberVector=true
 unknownChinesePercentVector=true
 dvSimplifiedModel=false
 scalingForInit=0.5
 maxTrainTimeSeconds=0
 unkWord=*UNK*
 lowercaseWordVectors=false
 transformMatrixType=DIAGONAL
 useContextWords=false
 trainWordVectors=true
 stalledIterationLimit=12
 markStrahler=false
Using EnglishTreebankParserParams splitIN=0 sPercent=false sNNP=0 sQuotes=false
sSFP=false rbGPA=false j#=false jJJ=false jNounTags=false sPPJJ=false sTRJJ=fals
e sJJCOMP=false sMoreLess=false unaryDT=false unaryRB=false unaryPRP=false reflP
RP=false unaryIN=false sCC=0 sNT=false sRB=false sAux=0 vpSubCat=false mDTV=0 sV
P=0 sVPNPAgr=false sSTag=0 mVP=false sNP%=0 sNPPRP=false dominatesV=0 dominatesI
=false dominatesC=false mCC=0 sSGapped=0 numNP=false sPoss=0 baseNP=0 sNPNNP=0 s
TMP=0 sNPADV=0 cTags=false rightPhrasal=false gpaRootVP=false splitSbar=0 mPPTOi
IN=0 cWh=0
Binarizing trees...done. Time elapsed: 12 ms
Extracting PCFG...PennTreeReader: warning: file has extra non-matching right par
enthesis [ignored]
Exception in thread "main" java.lang.IllegalArgumentException: No head rule defi
ned for _ using class edu.stanford.nlp.trees.ModCollinsHeadFinder in (_
  DELM
  DELM
  DELM
  13
  punct
  _
  _
  15
  ??????
  _
  N
  N_SING
  SING
  13
  appos
  _
  _
  16
  ???????
  _
  ADJ
  ADJ
  ADJ
  15
  amod
  _
  _
  17
  ??
  _
  P
  P
  P
  15
  prep
  _
  _
  18
  ???
  _
  N
  N_SING
  SING
  17
  pobj
  _
  _
  19
  ?
  _
  CON
  CON
  CON
  18
  cc
  _
  _
  20
  ????
  _
  N
  N_SING
  SING
  18
  conj
  _
  _
  21
  ????
  _
  N
  N_SING
  SING
  20
  poss/pc
  _
  _
  22)
    at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineNonTrivialH
ead(AbstractCollinsHeadFinder.java:242)
     at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(Abstra
ctCollinsHeadFinder.java:189)
     at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(Abstra
ctCollinsHeadFinder.java:140)
     at edu.stanford.nlp.parser.lexparser.TreeAnnotator.transformTreeHelper(T
reeAnnotator.java:145)
     at edu.stanford.nlp.parser.lexparser.TreeAnnotator.transformTree(TreeAnn
otator.java:51)
     at edu.stanford.nlp.parser.lexparser.TreeAnnotatorAndBinarizer.transform
Tree(TreeAnnotatorAndBinarizer.java:104)
     at edu.stanford.nlp.trees.CompositeTreeTransformer.transformTree(Composi
teTreeTransformer.java:30)
     at edu.stanford.nlp.trees.TransformingTreebank$TransformingTreebankItera
tor.next(TransformingTreebank.java:195)
     at edu.stanford.nlp.trees.TransformingTreebank$TransformingTreebankItera
tor.next(TransformingTreebank.java:176)
     at edu.stanford.nlp.trees.FilteringTreebank$FilteringTreebankIterator.pr
imeNext(FilteringTreebank.java:100)
     at edu.stanford.nlp.trees.FilteringTreebank$FilteringTreebankIterator.<i
nit>(FilteringTreebank.java:85)
     at edu.stanford.nlp.trees.FilteringTreebank.iterator(FilteringTreebank.j
ava:72)
     at edu.stanford.nlp.parser.lexparser.AbstractTreeExtractor.tallyTrees(Ab
stractTreeExtractor.java:64)
     at edu.stanford.nlp.parser.lexparser.AbstractTreeExtractor.extract(Abstr
actTreeExtractor.java:89)
     at edu.stanford.nlp.parser.lexparser.LexicalizedParser.getParserFromTree
bank(LexicalizedParser.java:881)
     at edu.stanford.nlp.parser.lexparser.LexicalizedParser.main(LexicalizedP
arser.java:1394)

2-使用Java代码训练模型:

import java.io.File;
import java.io.IOException;
import java.util.Collection;
import java.util.List;
import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
import edu.stanford.nlp.parser.lexparser.Options;
import edu.stanford.nlp.process.DocumentPreprocessor;
import edu.stanford.nlp.trees.GrammaticalStructure;
import edu.stanford.nlp.trees.GrammaticalStructureFactory;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.Treebank;
import edu.stanford.nlp.trees.TreebankLanguagePack;

public class FromTreeBank {
    public static void main(String[] args) throws IOException {
        // TODO Auto-generated method stub
        String treebankPathUPDT = "src/model/UPDT.1.2/train.conll";
        String persianFilePath  = "src/txt/persianSentences.txt";
        File file = new File(treebankPathUPDT);
        Options op = new Options();   
        Treebank tr = op.tlpParams.diskTreebank();
        tr.loadPath(file);    
        LexicalizedParser lpc = LexicalizedParser.trainFromTreebank(tr,op);
        //Once the lpc is trained, use it to parse a file which contains Persian text  
        //demoDP(lpc, persianFilePath);
    }

    public static void demoDP(LexicalizedParser lp, String filename) {
        // This option shows loading, sentence-segmenting and tokenizing
        // a file using DocumentPreprocessor.
        TreebankLanguagePack tlp = lp.treebankLanguagePack(); // a PennTreebankLanguagePack for English
        GrammaticalStructureFactory gsf = null;
        if (tlp.supportsGrammaticalStructures()) {
            gsf = tlp.grammaticalStructureFactory();
        }
        // You could also create a tokenizer here (as below) and pass it
        // to DocumentPreprocessor
        for (List<HasWord> sentence : new DocumentPreprocessor(filename)) {
            Tree parse = lp.apply(sentence);
            parse.pennPrint();
            System.out.println();
            if (gsf != null) {
                GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
                Collection tdl = gs.typedDependenciesCCprocessed();
                System.out.println(tdl);
                System.out.println();
            }
        }
    }
}

上面的Java程序也使这个例外:

Options parameters:
useUnknownWordSignatures 0
smoothInUnknownsThreshold 100
smartMutation false
useUnicodeType false
unknownSuffixSize 1
unknownPrefixSize 1
flexiTag false
useSignatureForKnownSmoothing false
wordClassesFile null
parserParams edu.stanford.nlp.parser.lexparser.EnglishTreebankParserParams
forceCNF false
doPCFG true
doDep true
freeDependencies false
directional true
genStop true
distance true
coarseDistance false
dcTags true
nPrune false
Train parameters:
 smooth=false
 PA=true
 GPA=false
 selSplit=false
 (0.0)
 mUnary=0
 mUnaryTags=false
 sPPT=false
 tagPA=false
 tagSelSplit=false (0.0)
 rightRec=false
 leftRec=false
 collinsPunc=false
 markov=false
 mOrd=1
 hSelSplit=false (10)
 compactGrammar=0
 postPA=false
 postGPA=false
 selPSplit=false (0.0)
 tagSelPSplit=false (0.0)
 postSplitWithBase=false
 fractionBeforeUnseenCounting=0.5
 openClassTypesThreshold=50
 preTransformer=null
 taggedFiles=null
 predictSplits=false
 splitCount=1
 splitRecombineRate=0.0
 simpleBinarizedLabels=false
 noRebinarization=false
 trainingThreads=1
 dvKBest=100
 trainingIterations=40
 batchSize=25
 regCost=1.0E-4
 qnIterationsPerBatch=1
 qnEstimates=15
 qnTolerance=15.0
 debugOutputFrequency=0
 randomSeed=0
 learningRate=0.1
 deltaMargin=0.1
 unknownNumberVector=true
 unknownDashedWordVectors=true
 unknownCapsVector=true
 unknownChineseYearVector=true
 unknownChineseNumberVector=true
 unknownChinesePercentVector=true
 dvSimplifiedModel=false
 scalingForInit=0.5
 maxTrainTimeSeconds=0
 unkWord=*UNK*
 lowercaseWordVectors=false
 transformMatrixType=DIAGONAL
 useContextWords=false
 trainWordVectors=true
 stalledIterationLimit=12
 markStrahler=false
Using EnglishTreebankParserParams splitIN=0 sPercent=false sNNP=0 sQuotes=false sSFP=false rbGPA=false j#=false jJJ=false jNounTags=false sPPJJ=false sTRJJ=false sJJCOMP=false sMoreLess=false unaryDT=false unaryRB=false unaryPRP=false reflPRP=false unaryIN=false sCC=0 sNT=false sRB=false sAux=0 vpSubCat=false mDTV=0 sVP=0 sVPNPAgr=false sSTag=0 mVP=false sNP%=0 sNPPRP=false dominatesV=0 dominatesI=false dominatesC=false mCC=0 sSGapped=0 numNP=false sPoss=0 baseNP=0 sNPNNP=0 sTMP=0 sNPADV=0 cTags=false rightPhrasal=false gpaRootVP=false splitSbar=0 mPPTOiIN=0 cWh=0
Binarizing trees...done. Time elapsed: 122 ms
Extracting PCFG...PennTreeReader: warning: file has extra non-matching right parenthesis [ignored]
java.lang.IllegalArgumentException: No head rule defined for _ using class edu.stanford.nlp.trees.ModCollinsHeadFinder in (_
  DELM
  DELM
  DELM
  13
  punct
  _
  _
  15
  تلفیقی
  _
  N
  N_SING
  SING
  13
  appos
  _
  _
  16
  طنزآمیز
  _
  ADJ
  ADJ
  ADJ
  15
  amod
  _
  _
  17
  از
  _
  P
  P
  P
  15
  prep
  _
  _
  18
  اسم
  _
  N
  N_SING
  SING
  17
  pobj
  _
  _
  19
  و
  _
  CON
  CON
  CON
  18
  cc
  _
  _
  20
  شیوه
  _
  N
  N_SING
  SING
  18
  conj
  _
  _
  21
  کارش
  _
  N
  N_SING
  SING
  20
  poss/pc
  _
  _
  22)

    at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineNonTrivialHead(AbstractCollinsHeadFinder.java:242)
    at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(AbstractCollinsHeadFinder.java:189)
    at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(AbstractCollinsHeadFinder.java:140)
    at edu.stanford.nlp.parser.lexparser.TreeAnnotator.transformTreeHelper(TreeAnnotator.java:145)
    at edu.stanford.nlp.parser.lexparser.TreeAnnotator.transformTree(TreeAnnotator.java:51)
    at edu.stanford.nlp.parser.lexparser.TreeAnnotatorAndBinarizer.transformTree(TreeAnnotatorAndBinarizer.java:104)
    at edu.stanford.nlp.trees.CompositeTreeTransformer.transformTree(CompositeTreeTransformer.java:30)
    at edu.stanford.nlp.trees.TransformingTreebank$TransformingTreebankIterator.next(TransformingTreebank.java:195)
    at edu.stanford.nlp.trees.TransformingTreebank$TransformingTreebankIterator.next(TransformingTreebank.java:176)
    at edu.stanford.nlp.trees.FilteringTreebank$FilteringTreebankIterator.primeNext(FilteringTreebank.java:100)
    at edu.stanford.nlp.trees.FilteringTreebank$FilteringTreebankIterator.<init>(FilteringTreebank.java:85)
    at edu.stanford.nlp.trees.FilteringTreebank.iterator(FilteringTreebank.java:72)
    at edu.stanford.nlp.parser.lexparser.AbstractTreeExtractor.tallyTrees(AbstractTreeExtractor.java:64)
    at edu.stanford.nlp.parser.lexparser.AbstractTreeExtractor.extract(AbstractTreeExtractor.java:89)
    at edu.stanford.nlp.parser.lexparser.LexicalizedParser.getParserFromTreebank(LexicalizedParser.java:881)
    at edu.stanford.nlp.parser.lexparser.LexicalizedParser.trainFromTreebank(LexicalizedParser.java:267)
    at edu.stanford.nlp.parser.lexparser.LexicalizedParser.trainFromTreebank(LexicalizedParser.java:278)
    at FromTreeBank.main(FromTreeBank.java:46)

实际上,我不确定命令行或Java代码是否正确。我无法弄清楚命令行或 Java 代码中缺少什么,如果有人告诉我为什么我会收到这些异常以及有什么问题,我将不胜感激?或者建议任何更好的方法来从树库训练模型。

谢谢

这里最大的问题是你试图用依赖树库训练一个选区树解析器(又名短语结构树解析器),这是行不通的。

CoreNLP还带有一个基于神经网络的依赖解析器,你可以用UPDT数据进行训练。查看解析器的项目页面,了解有关如何训练模型的说明。

如果您仍然想知道为什么会出现此错误,则与错误中所说的相同。对于这个字符"_"(我认为它的名字是下划线),在edu.stanford.nlp.trees.ModCollinsHeadFinder类中没有定义规则。

我对括号字符有同样的东西,在删除包含括号的数据后,现在我可以毫无错误地训练斯坦福解析器。我还没有尝试找到通过更改代码来解决它的直接解决方案。最简单的方法是删除包含像我这样的字符的数据。

如果您已经解决了问题,可以分享吗?我还需要更多关于斯坦福解析器的知识。

您可以在"trainFile.conll"(或任何其他格式)中简单地将所有"("替换为"-LRB-",并将所有")替换为"-RRB-",然后重新运行解析器。这对我有用。

最新更新