csv文件包含带有特殊字符的数据,包括逗号(,),(\)和(" ")。无法创建具有正确列数的 df?-py-spark



我有一个CSV文件,我想用py spark中的CSV文件创建一个数据帧,但无法做到这一点,因为有些行包含带有特殊字符的数据,其中一半列是双引号的。以下是我迄今为止尝试的数据。

sample_row

"ABG090D",2019-03-03 00:00:00.0000000,"A","some Data C" AB01","Some Data","LOS","NEW",2019-04-11 00:00:00.0000000,"GHYTR","7860973478","0989","A",2019-03-03 00:00:00.0000000,"Y","N","N","N",1,"N","D016619",,"$,$#,&","Y",
"69901",,,,"FGF",89.00,"W",,"N","R","F",5.00,6.00,6.00,9.00,2.00,0,0,"9090",,"N",,,"1","N",,,"F",,2019-03-03 00:00:00.0000000,,,,,"N","A","N","N","N","N","N",,,,,,,"H",,,,,,,,,,"N","A","0","0","0",,0,0,0,0,0,0,0,"N","00","USA",
"C","I",0,,,,"FGF",0,,,"N","UOIU","5",,0,,0,0,,,"878","N",2019-04-11 09:44:00.0000000,"8980909","H",,,,"N","2","T","SomeData",
2020-03-12 09:24:52.0000000

在上述数据中,我面临的两个主要问题是:

1."some Data C\"AB01">=>因为它包含反斜杠((和引号("(作为数据的一部分。

2."$,$#,&">=>因为它包含逗号(,(作为数据的一部分

df = spark.read.option("quote",""").option("escape",""").option("escape","\").option("delimiter" , ",").option("ignoreLeadingWhiteSpace", "true").csv("/path/file.csv",customSchema)

使用上面的代码,我能够解决"一些数据C\"AB01">,但第二列,即"$,$#,&">在这里制造了一个问题。

甚至我也试着使用下面链接中给出的答案。但这对我来说也不起作用。如何在Scala中删除TextQualifier文件的双引号中的双引号和额外分隔符

在您的情况下,您最好构建自己的解析器。我写了一个简单的例子,如下所示,使用正则表达式解析文件并将值存储在values列表中。

希望这种方法对你有用。

import re
regex = r"("([^"]+)",?|([^,]+),?|,)"
test_str = ""ABG090D",2019-03-03 00:00:00.0000000,"A","some Data C\" AB01","Some Data","LOS","NEW",2019-04-11 00:00:00.0000000,"GHYTR","7860973478","0989","A",2019-03-03 00:00:00.0000000,"Y","N","N","N",1,"N","D016619",,"$,$#,&","Y", "69901",,,,"FGF",89.00,"W",,"N","R","F",5.00,6.00,6.00,9.00,2.00,0,0,"9090",,"N",,,"1","N",,,"F",,2019-03-03 00:00:00.0000000,,,,,"N","A","N","N","N","N","N",,,,,,,"H",,,,,,,,,,"N","A","0","0","0",,0,0,0,0,0,0,0,"N","00","USA", "C","I",0,,,,"FGF",0,,,"N","UOIU","5",,0,,0,0,,,"878","N",2019-04-11 09:44:00.0000000,"8980909","H",,,,"N","2","T","SomeData", 2020-03-12 09:24:52.0000000"
matches = re.finditer(regex, test_str, re.MULTILINE)
values = []
for matchNum, match in enumerate(matches, start=1):
if match.group(3) != None:
values.append(match.group(3))
elif match.group(2) != None:
values.append(match.group(2))
else:
values.append(None)
print(values)

最新更新