要求失败:johnsnowlabs.nlp 中的输入列注释器错误或缺失



我正在使用带有 spark-2.4.4 的com.johnsnowlabs.nlp-2.2.2来处理一些文章。在这些文章中,有一些很长的单词我不感兴趣,这会大大减慢POS标记的速度。我想在标记化之后和 POSTagging 之前排除它们。

我尝试编写较小的代码来重现我的问题

import sc.implicits._
val documenter = new DocumentAssembler().setInputCol("text").setOutputCol("document").setIdCol("id")
val tokenizer = new Tokenizer().setInputCols(Array("document")).setOutputCol("token")
val normalizer = new Normalizer().setInputCols("token").setOutputCol("normalized").setLowercase(true)
val df = Seq("This is a very useless/ugly sentence").toDF("text")
val document = documenter.transform(df.withColumn("id", monotonically_increasing_id()))
val token = tokenizer.fit(document).transform(document)
val token_filtered = token
.drop("token")
.join(token
.select(col("id"), col("token"))
.withColumn("tmp", explode(col("token")))
.groupBy("id")
.agg(collect_list(col("tmp")).as("token")),
Seq("id"))
token_filtered.select($"token").show(false)
val normal = normalizer.fit(token_filtered).transform(token_filtered)

我在转换token_filtered时遇到此错误

+--------------------+---+--------------------+--------------------+--------------------+
|                text| id|            document|            sentence|               token|
+--------------------+---+--------------------+--------------------+--------------------+
|This is a very us...|  0|[[document, 0, 35...|[[document, 0, 35...|[[token, 0, 3, Th...|
+--------------------+---+--------------------+--------------------+--------------------+

Exception in thread "main" java.lang.IllegalArgumentException:
requirement failed: Wrong or missing inputCols annotators in NORMALIZER_4bde2f08742a.
Received inputCols: token.
Make sure such annotators exist in your pipeline, with the right output
names and that they have following annotator types: token

如果我直接在normalizer中适应和转换token,它就可以正常工作 似乎在explode/groupBy/collect_list期间,某些信息丢失了,但模式和数据看起来是一样的。

知道吗?

但是,要更新@ticapix给出的正确答案,在新版本中,SentenceDetectorTokenizer添加了两个功能,分别是minLenghtmaxLength

  • https://github.com/JohnSnowLabs/spark-nlp/pull/712
  • https://github.com/JohnSnowLabs/spark-nlp/pull/711

只需筛选出不希望通过管道提供的令牌:

val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
.setMinLength(4)
.setMaxLength(10)

参考资料

  • https://github.com/JohnSnowLabs/spark-nlp
  • https://github.com/JohnSnowLabs/spark-nlp-models
  • https://github.com/JohnSnowLabs/spark-nlp-workshop

答案是:不可行。(https://github.com/JohnSnowLabs/spark-nlp/issues/653(

注释器在groupBy操作期间被销毁。

解决方案是:

  • 实现自定义Transformer
  • 使用 UDF
  • 在将数据馈送到管道之前对其进行预处理

最新更新