我有一个充满IP地址的数据帧。我有一个要从数据帧中删除的 IP 地址列表。我想在根据"lista"删除所有 IP 地址后有一个新的数据帧"filtered_list"。
我在如何在 Spark 的过滤条件中使用 NOT IN 子句中看到了一个例子。但是即使在对过滤器执行"不"之前,我似乎也无法让它工作 请帮忙。
例:
var df = Seq("119.73.148.227", "42.61.124.218", "42.61.66.174", "118.201.94.2","118.201.149.146", "119.73.234.82", "42.61.110.239", "58.185.72.118", "115.42.231.178").toDF("ipAddress")
var lista = List("119.73.148.227", "118.201.94.2")
var filtered_list = df.filter(col("ipAddress").isin(lista))
我遇到以下错误消息:
java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon List(119.73.148.227, 118.201.94.2)
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)
at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:163)
at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:163)
at scala.util.Try.getOrElse(Try.scala:79)
at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:162)
at org.apache.spark.sql.functions$.typedLit(functions.scala:113)
at org.apache.spark.sql.functions$.lit(functions.scala:96)
at org.apache.spark.sql.Column$$anonfun$isin$1.apply(Column.scala:787)
at org.apache.spark.sql.Column$$anonfun$isin$1.apply(Column.scala:787)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.Column.isin(Column.scala:787)
... 52 elided
可以在数据帧上使用 except 方法。
var df = Seq("119.73.148.227", "42.61.124.218", "42.61.66.174", "118.201.94.2","118.201.149.146", "119.73.234.82", "42.61.110.239", "58.185.72.118", "115.42.231.178").toDF("ipAddress")
var lista = Seq("119.73.148.227", "118.201.94.2").toDF("ipAddress")
var onlyWantedIp = df.except(lista)
isin
需要varargs,而不是List
。您必须使用:_*
属性将列表分散到单独的元素中:
var filtered_list = df.filter(col("ipAddress").isin(lista: _*))