我正在使用下面的代码读取Spark DataFrame中的CSV文件,但输出是一团糟:
df = spark.read.format('csv').options(header=True, inferSchema=True).csv('spark.csv')
输出:
+----------------+----------------+----------------+------------------+---------------------+------------------+-----------------------+--------------+------------+-------------+-----------+----------+----------------+--------+--------------------+----------+----------+------+--------+--------+--------+--------------------+--------+---------------+|PARID|PROPERTYHOUSENUM|PROPERTYFRACTION|PROPERTYADDRESSDIR|propertyaddrstreet|PROPERTYADDRESSSUF|propertyddressunitdesc|propertynitno|PROPERTYCITY|propertysite|PROPERTYZIP|SCHOOLCODE|SCHOOLDESC|MUNIDESC|RECORDDATE|SALEDATE|PRICE|DEEDBOOK|DEEDPAGE|SALECODE|SALEDESC|INSTRTYP|instrttydesc|+----------------+----------------+----------------+------------------+---------------------+------------------+-----------------------+--------------+------------+-------------+-----------+----------+----------------+--------+--------------------+----------+----------+------+--------+--------+--------+--------------------+--------+---------------+|1075F000108000000|4720||null |HIGHPOINT|DR|null |null |GIBSONIA|PA|15044|20|Hampton Town|914|Hampton|2012-09-27|2012-09-27 |1120000|15020|356|3|爱与影响…|契约||0011A00237000000|0||null |LOMBARD|ST|null |null |PITSBURGH|PA|15219|47|Pittsburg|103|3rd Ward-PITSB|2015-01-06|2015-01-06|1783|TR15|00002|2|市司库销售|TS|司库契约||0011J00047000000|1903||null |FORBES|AVE|null |null |PITTSBURGH|PA|15219|47|PITTSBURGH|101|1st Ward-PITTS|2012-10-26 | 2012-10-26 |4643|TR13|003|2|市司库出售|TS|司库契约||0113B000029000000|479||null | ROOSEVELT|AVE|null | null |匹兹堡|PA |15202|29|Northgate |803|Bellevue |2017-03-27|2017-03-06|0|16739|166|3|爱与影响…|纠正行为||0119S00024000000|5418||null |CARNEGIE|ST|null |null |PITSBURGH|PA|15201|47|PITTSBURGH|110|10th Ward-PITS|2015-02-04 | 2015-02-04 |27541|TR15|00059|GV|政府出售|TS|财务契约|+----------------+----------------+----------------+------------------+---------------------+------------------+-----------------------+--------------+------------+-------------+-----------+----------+----------------+--------+--------------------+----------+----------+------+--------+--------+--------+--------------------+--------+---------------+仅显示前5行
我是大数据问题的新手,我正在努力学习如何正确使用Spark来实现这一目标。如何正确读取此数据帧?我有没有遗漏一些选择?
您已经正确读取了数据帧,但数据帧太宽(列太多(,无法放入窗口,因此行被换行,导致输出混乱。
如果想要更整洁的输出,请尝试df.show(vertical=True)
,或者选择几个要显示的列,例如显示前三列的df.select(df.columns[:2]).show()
。