关于使用scala的spark-nlp的错误

我是激发nlp的初学者，我通过在johnsnowlabs中的以下示例来学习它。我在数据块中使用SCALA。

当我按照下面的例子，

import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler().
setInputCol("text").
setOutputCol("document")
val regexTokenizer = new Tokenizer().
setInputCols(Array("sentence")).
setOutputCol("token")
val sentenceDetector = new SentenceDetector().
setInputCols(Array("document")).
setOutputCol("sentence")
val finisher = new Finisher()
.setInputCols("token")
.setIncludeMetadata(true)

finisher.withColumn("newCol", explode(arrays_zip($"finished_token", $"finished_ner")))

当我运行最后一行时，我得到了以下错误：

command-786892578143744:2: error: value withColumn is not a member of com.johnsnowlabs.nlp.Finisher
finisher.withColumn("newCol", explode(arrays_zip($"finished_token", $"finished_ner")))

这可能是什么原因？

当我尝试做这个例子时，通过省略这一行，我添加了以下额外的代码行

val pipeline = new Pipeline().
setStages(Array(
documentAssembler,
sentenceDetector,
regexTokenizer,
finisher
))
val data1 = Seq("hello, this is an example sentence").toDF("text")
pipeline.fit(data1).transform(data1).toDF("text")

当我运行最后一行时，我遇到了另一个错误：

java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.

有人能帮我解决这个问题吗？

谢谢

以下是您的代码应该是什么样子的，首先构造管道：

import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler().
setInputCol("text").
setOutputCol("document")
val regexTokenizer = new Tokenizer().
setInputCols(Array("sentence")).
setOutputCol("token")
val sentenceDetector = new SentenceDetector().
setInputCols(Array("document")).
setOutputCol("sentence")
val finisher = new Finisher()
.setInputCols("token")
.setIncludeMetadata(true)
val pipeline = new Pipeline().
setStages(Array(
documentAssembler,
sentenceDetector,
regexTokenizer,
finisher
))

创建一个简单的DataFrame进行测试：

val data1 = Seq("hello, this is an example sentence").toDF("text")

现在，我们在这个管道上安装并转换您的DataFrame：

val prediction = pipeline.fit(data1).transform(data1)

变量prediction是一个DataFrame，在其中可以分解标记列。让我们看看prediction数据帧内部：

scala> prediction.show
+--------------------+--------------------+-----------------------+
|                text|      finished_token|finished_token_metadata|
+--------------------+--------------------+-----------------------+
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|
+--------------------+--------------------+-----------------------+
scala> prediction.withColumn("newCol", explode($"finished_token")).show
+--------------------+--------------------+-----------------------+--------+
|                text|      finished_token|finished_token_metadata|  newCol|
+--------------------+--------------------+-----------------------+--------+
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|   hello|
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|       ,|
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|    this|
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|      is|
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|      an|
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...| example|
|hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|sentence|
+--------------------+--------------------+-----------------------+--------+

Alberto提到的第一期，认为finisher是一个DataFrame。它是一个注释器，直到它被转换。
第二个问题是在你不需要的地方放了.toDF((。(在管道转换之后(
你的爆炸函数处于一个糟糕的位置，你正在压缩一个甚至不存在于你的管道中的列：ner

请随时提出任何问题，我会相应地更新答案。

我认为您有两个问题，1.首先，您试图将withColumn应用于注释器，您应该在数据帧上执行此操作。2.我认为这是一个来自转换后的toDF((的问题。您需要更多的列，而您只提供了1个。此外，您可能根本不需要这个toDF((。

阿尔贝托。

相关内容

最新更新

热门标签：