Stanford NLP的文本令牌化:过滤不需要的单词和字符

我在分类工具中使用 Stanford NLP进行字符串令牌化。我只想得到有意义的单词，但是我得到了非单词令牌（例如 ---， >， .等），而诸如 am， is， to（停止单词）之类的词不重要。有人知道解决此问题的方法吗？

在斯坦福·科伦普（Stanford Corenlp）中，有一个停止词删除注释器，该注释提供了删除" startord stopwords"的功能。您还可以根据需要（即---，＆lt;，。等）在此处定义自定义停止字

您可以在此处看到示例：

   Properties props = new Properties();
   props.put("annotators", "tokenize, ssplit, stopword");
   props.setProperty("customAnnotatorClass.stopword", "intoxicant.analytics.coreNlp.StopwordAnnotator");
   StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
   Annotation document = new Annotation(example);
   pipeline.annotate(document);
   List<CoreLabel> tokens = document.get(CoreAnnotations.TokensAnnotation.class);

在上面的示例中，" tokenize，ssplit，stopwords"被设置为自定义停止字。

希望它能帮助您.... !!

这是我们在Corenlp中不为您执行的非常特定的特定任务。您应该能够使用正则表达过滤器和Corenlp Tokenizer顶部的端子过滤器进行此工作。

这是英语停止字的示例列表。

相关内容

最新更新

热门标签：