如何在 PySpark 中删除标头中的双引号和';'



我正在尝试删除"quot;和来自我在PySpark中的CSV文件。CSV中的数据如下所示:

age;"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"

我使用的代码是:

df = spark.read.options(delimiter=';').csv("C:/Project_bankdata.csv", header=True)
df1 = df.select([F.regexp_replace(c, '"', '').alias(c) for c in df.columns])
df1.show(10,truncate=0)

输出:

|"age;""job""   |""marital""|""education""|""default""|""balance""|""housing""|""loan""|""contact""|""day""|""month""|""duration""|""campaign""|""pdays""|""previous""|""poutcome""|""y"""|
+---------------+-----------+-------------+-----------+-----------+-----------+--------+-----------+-------+---------+------------+------------+---------+------------+------------+------+
|58;management  |married    |tertiary     |no         |2143       |yes        |no      |unknown    |5      |may      |261         |1           |-1       |0           |unknown     |no    |

我可以从数据中去掉引号,但不能从标题中去掉。如何删除页眉中的双引号?

只有当我使用这个输入CSV:时,我才能重现你的输出

"age;""job"";""marital"";""education"";""default"";""balance"";""housing"";""loan"";""contact"";""day"";""month"";""duration"";""campaign"";""pdays"";""previous"";""poutcome"";""y"""
"58;"management"";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"

您可以将CSV读取为文本文件,从每行中删除所有双引号",然后制作一个数据帧。

rdd = spark.sparkContext.textFile(r"C:temptemp.csv")
rdd = rdd.map(lambda line: line.replace('"', '').split(';'))
header = rdd.first()
df = rdd.filter(lambda line: line != header).toDF(header)
df.show()
# +---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
# |age|       job|marital|education|default|balance|housing|loan|contact|day|month|duration|campaign|pdays|previous|poutcome|  y|
# +---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
# | 58|management|married| tertiary|     no|   2143|    yes|  no|unknown|  5|  may|     261|       1|   -1|       0| unknown| no|
# +---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+

注意这样可以有效地从CSV文件中删除字符串表示法。因此,只有当您没有包含;的值时,这才会很好地工作。

相关内容

  • 没有找到相关文章

最新更新