我想知道你如何在pyspark中迭代地从json字符串中获取值。我有以下格式的我的数据,并希望创建"值"列:
<表类>
id_1
id_2
json_string
价值
tbody><<tr>1 1001 {"1001":106年,"2200":101} 106 12200 {"1001":106年,"2200":101} 101 表类>
您可以在expr()
中使用它,这将允许您连接字符串和id_2
。
data_ls = [
("1", "1001", '''{"1001":106, "2200":101}'''),
("1", "2200", '''{"1001":106, "2200":101}''')
]
data_sdf = spark.createDataFrame(data_ls, ("id1", "id2", "jstr"))
# +---+----+--------------------+
# |id1| id2| jstr|
# +---+----+--------------------+
# | 1|1001|{"1001":106, "220...|
# | 1|2200|{"1001":106, "220...|
# +---+----+--------------------+
data_sdf.
withColumn('val', func.expr('get_json_object(jstr, concat("$.", id2))')).
show(truncate=False)
# +---+----+------------------------+---+
# |id1|id2 |jstr |val|
# +---+----+------------------------+---+
# |1 |1001|{"1001":106, "2200":101}|106|
# |1 |2200|{"1001":106, "2200":101}|101|
# +---+----+------------------------+---+