按 Spark 数据帧中的数组值进行筛选

我正在使用带有 elasticsearch 的 apache spark 1.5 数据帧，我尝试从包含 id 列表（数组）的列中过滤 id。

例如，elasticsearch 列的映射如下所示：

    {
        "people":{
            "properties":{
                "artist":{
                   "properties":{
                      "id":{
                         "index":"not_analyzed",
                         "type":"string"
                       },
                       "name":{
                          "type":"string",
                          "index":"not_analyzed",
                       }
                   }
               }
          }
    }

示例数据格式如下所示

{
    "people": {
        "artist": {
            [
                  {
                       "id": "153",
                       "name": "Tom"
                  },
                  {
                       "id": "15389",
                       "name": "Cok"
                  }
            ]
        }
    }
},
{
    "people": {
        "artist": {
            [
                  {
                       "id": "369",
                       "name": "Carl"
                  },
                  {
                       "id": "15389",
                       "name": "Cok"
                  },
                 {
                       "id": "698",
                       "name": "Sol"
                  }
            ]
        }
    }
}

在火花中，我尝试这个：

val peopleId  = 152
val dataFrame = sqlContext.read
     .format("org.elasticsearch.spark.sql")
     .load("index/type")
dataFrame.filter(dataFrame("people.artist.id").contains(peopleId))
    .select("people_sequence.artist.id")

我得到了包含 152 的所有 id，例如 1523 ，152978但不仅是 id == 152

然后我试了

dataFrame.filter(dataFrame("people.artist.id").equalTo(peopleId))
    .select("people.artist.id")

我

变得空虚，我明白为什么，那是因为我有一堆 people.artist.id

谁能告诉我当我有ID列表时如何过滤？

在 Spark 1.5+ 中，您可以使用array_contains函数：

df.where(array_contains($"people.artist.id", "153"))

如果您使用早期版本，则可以尝试如下UDF：

val containsId = udf(
  (rs: Seq[Row], v: String) => rs.map(_.getAs[String]("id")).exists(_ == v))
df.where(containsId($"people.artist", lit("153")))

相关内容

最新更新

热门标签：