我有一个如下所示的数据帧。
ID, details_Json
1 {"name":"Anne","Age":"12","country":"Denmark"}
2 {"name":"Zen","Age":"24"}
3 {"name":"Fred","Age":"20","country":"France"}
4 {"name":"Mona","Age":"18","country":"Denmark"}
如您所见,json 中的字段不是固定的。它可以包含多个给定字段。我的意思是有时name, Age, country
,有时可能是name, Age, country, University
或name, Age, university
我想过滤包含country
的行,这些行在其 json 和国家/地区等于丹麦。
我的输出应如下所示。
ID, details_Json
1 {"name":"Anne","Age":"12","country":"Denmark"}
4 {"name":"Mona","Age":"18","country":"Denmark"}
有什么办法可以做到这一点吗?
谢谢:)
这里有一种方法:
//Construct dataframe
val df = sc.parallelize(Seq((1,"{"name":"Anne","Age":"12","country":"Denmark"}"),
(2, "{"name":"Zen","Age":"24"}"),
(3, "{"name":"Fred","Age":"20","country":"France"}"),
(4, "{"name":"Mona","Age":"18","country":"Denmark"}"))).toDF("ID", "details_Json")
df.show
+---+--------------------+
| ID| details_Json|
+---+--------------------+
| 1|{"name":"Anne","A...|
| 2|{"name":"Zen","Ag...|
| 3|{"name":"Fred","A...|
| 4|{"name":"Mona","A...|
+---+--------------------+
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
val struct =
StructType(
StructField("name", StringType, true) ::
StructField("Age", StringType, true) ::
StructField("country", StringType, true) :: Nil)
val df2 = df.withColumn("details_Struct", from_json($"details_Json", struct)).withColumn("country", $"details_Struct".getField("country")).filter($"country".equalTo("Denmark")).drop("country", "details_Struct")
df2.show
+---+--------------------+
| ID| details_Json|
+---+--------------------+
| 1|{"name":"Anne","A...|
| 4|{"name":"Mona","A...|
+---+--------------------+
上面的答案在Apache Spark 2.3.1中。您使用哪个版本?在版本 2.4.1 中,有一个自动推断架构的schema_of_json
函数。您可能还想检查一下。https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@schema_of_json(json:String):org.apache.spark.sql.Column