Pyspark,迭代从包含json字符串的列中获取值



我想知道你如何在pyspark中迭代地从json字符串中获取值。我有以下格式的我的数据,并希望创建"值"列:

<表类> id_1 id_2 json_string 价值 tbody><<tr>11001{"1001":106年,"2200":101}10612200{"1001":106年,"2200":101}101

您可以在expr()中使用它,这将允许您连接字符串和id_2

data_ls = [
("1", "1001", '''{"1001":106, "2200":101}'''), 
("1", "2200", '''{"1001":106, "2200":101}''')
]
data_sdf = spark.createDataFrame(data_ls, ("id1", "id2", "jstr"))
# +---+----+--------------------+
# |id1| id2|                jstr|
# +---+----+--------------------+
# |  1|1001|{"1001":106, "220...|
# |  1|2200|{"1001":106, "220...|
# +---+----+--------------------+
data_sdf. 
withColumn('val', func.expr('get_json_object(jstr, concat("$.", id2))')). 
show(truncate=False)
# +---+----+------------------------+---+
# |id1|id2 |jstr                    |val|
# +---+----+------------------------+---+
# |1  |1001|{"1001":106, "2200":101}|106|
# |1  |2200|{"1001":106, "2200":101}|101|
# +---+----+------------------------+---+

相关内容

  • 没有找到相关文章