我想更新json中存储为字符串列的键名,并将其保存回字符串类型列。我正在从我的csv&存储为csv。
这就是我的输入csv的样子。
candidate_email,transactions
cust2@email.com,"[{'transaction_id':'12', 'transaction_amount':'$23.43'},{'transaction_id':'15', 'transaction_amount':'$723.41'}]"
cust1@email.com,"[{'transaction_id':'10', 'transaction_amount':'$55.99'},{'transaction_id':'11', 'transaction_amount':'$20.46'},{'transaction_id':'13', 'transaction_amount':'$5.89'},{'transaction_id':'14', 'transaction_amount':'$35.61'}]"
我想在json中用id
替换transaction_id
密钥,用amount
替换transaction_amount
密钥,并将其保存回csv。
input_df = spark.read.csv('transactions/*.csv', header='true', inferSchema = True)
input_df.printSchema()
# root
# |-- candidate_email: string (nullable = true)
# |-- transactions: string (nullable = true)
input_df.show(10, False)
# +-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
# |candidate_email|transactions |
# +-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
# |cust2@email.com |[{'transaction_id':'12', 'transaction_amount':'$23.43'},{'transaction_id':'15', 'transaction_amount':'$723.41'}] |
# |cust1@email.com |[{'transaction_id':'10', 'transaction_amount':'$55.99'},{'transaction_id':'11', 'transaction_amount':'$20.46'},{'transaction_id':'13', 'transaction_amount':'$5.89'},{'transaction_id':'14', 'transaction_amount':'$35.61'}]|
# +-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
如何替换密钥以获得以下输出?
output_df.show(10,False)
# +---------------+----------------------------------------------------------------------------------------------------------------------------+
# |candidate_email|transactions |
# +---------------+----------------------------------------------------------------------------------------------------------------------------+
# |cust1@email.com|[{'id':'10', 'amount':'$55.99'},{'id':'11', 'amount':'$20.46'},{'id':'13', 'amount':'$5.89'},{'id':'14', 'amount':'$35.61'}]|
# |cust2@email.com|[{'id':'12', 'amount':'$23.43'},{'id':'15', 'amount':'$723.41'}] |
# +---------------+----------------------------------------------------------------------------------------------------------------------------+
注意:这两列都是字符串类型的列。
output_df.printSchema()
# root
# |-- candidate_email: string (nullable = true)
# |-- transactions: string (nullable = true)
使用from_json
将transactions列读取为array(struct...)
,然后强制转换为所需的字段名。
- 然后
explode + to_json + groupBy + collect_list
获取所需的json
Example:
df.show()
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|candidate_email|transactions |
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|cust2@email.com|[{'transaction_id':'12', 'transaction_amount':'$23.43'},{'transaction_id':'15', 'transaction_amount':'$723.41'}]|
#+---------------+----------------------------------------------------------------------------------------------------------------+
st=ArrayType(StructType([StructField("transaction_id", StringType()),StructField("transaction_amount", StringType())]))
df.withColumn("jsn",from_json(col("transactions"),st).cast("array<struct<id:string,amount:string>>")).
selectExpr("*","explode(jsn)").
select("*","col.*").
drop(*drop_cols).
selectExpr("candidate_email","to_json(struct(id,amount)) as trans").
groupBy("candidate_email").
agg(collect_list("trans").alias("transactions")).
show(10,False)
#+---------------+---------------------------------------------------------------+
#|candidate_email|transactions |
#+---------------+---------------------------------------------------------------+
#|cust2@email.com|[{"id":"12","amount":"$23.43"}, {"id":"15","amount":"$723.41"}]|
#+---------------+---------------------------------------------------------------+