我正在尝试删除"quot;和来自我在PySpark中的CSV文件。CSV中的数据如下所示:
age;"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"
我使用的代码是:
df = spark.read.options(delimiter=';').csv("C:/Project_bankdata.csv", header=True)
df1 = df.select([F.regexp_replace(c, '"', '').alias(c) for c in df.columns])
df1.show(10,truncate=0)
输出:
|"age;""job"" |""marital""|""education""|""default""|""balance""|""housing""|""loan""|""contact""|""day""|""month""|""duration""|""campaign""|""pdays""|""previous""|""poutcome""|""y"""|
+---------------+-----------+-------------+-----------+-----------+-----------+--------+-----------+-------+---------+------------+------------+---------+------------+------------+------+
|58;management |married |tertiary |no |2143 |yes |no |unknown |5 |may |261 |1 |-1 |0 |unknown |no |
我可以从数据中去掉引号,但不能从标题中去掉。如何删除页眉中的双引号?
只有当我使用这个输入CSV:时,我才能重现你的输出
"age;""job"";""marital"";""education"";""default"";""balance"";""housing"";""loan"";""contact"";""day"";""month"";""duration"";""campaign"";""pdays"";""previous"";""poutcome"";""y"""
"58;"management"";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"
您可以将CSV读取为文本文件,从每行中删除所有双引号"
,然后制作一个数据帧。
rdd = spark.sparkContext.textFile(r"C:temptemp.csv")
rdd = rdd.map(lambda line: line.replace('"', '').split(';'))
header = rdd.first()
df = rdd.filter(lambda line: line != header).toDF(header)
df.show()
# +---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
# |age| job|marital|education|default|balance|housing|loan|contact|day|month|duration|campaign|pdays|previous|poutcome| y|
# +---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
# | 58|management|married| tertiary| no| 2143| yes| no|unknown| 5| may| 261| 1| -1| 0| unknown| no|
# +---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
注意这样可以有效地从CSV文件中删除字符串表示法。因此,只有当您没有包含;
的值时,这才会很好地工作。