如何定义自定义架构来读取csv文件

我正在尝试定义一个没有csv文件中所有列的自定义模式，这可能吗？

当我尝试这样做时，值与我指定的列不对应，因为它一直在打印csv文件中的顺序。

数据集：https://data.cityofnewyork.us/dataset/GreenThumb-Garden-Info/p78i-pat6

https://i.stack.imgur.com/jMHFT.png

基于您提供的样本数据链接，我添加了一个完整的模式，它应该允许您在没有任何问题的情况下读取样本CSV数据。如果您不需要某些字段，您可以使用select，只选择与您的进一步处理需求相关的字段。示例模式：

from pyspark.sql.types import *
customSchema = StructType([
StructField("assemblydist", IntegerType(), True),
StructField("borough", StringType(), True),
StructField("communityboard", IntegerType(), True),
StructField("congressionaldist", IntegerType(), True),
StructField("coundist", IntegerType(), True),
StructField("gardenname", StringType(), True),
StructField("juris", StringType(), True),
StructField("multipolygon", StringType(), True),
StructField("openhrsf", StringType(), True),
StructField("openhrsm", StringType(), True),
StructField("openhrssa", StringType(), True),
StructField("openhrssu", StringType(), True),
StructField("openhrsth", StringType(), True),
StructField("openhrstu", StringType(), True),
StructField("openhrsw", StringType(), True),
StructField("parksid", StringType(), True),
StructField("policeprecinct", StringType(), True),
StructField("statesenatedist", IntegerType(), True),
StructField("status", StringType(), True),
StructField("zipcode", IntegerType(), True)
])

如果您想解码Multi-Polygon类型，那么这将需要额外的步骤来解析该数据——为了便于使用，我只是将其作为一个普通字符串读取。

相关内容

最新更新

热门标签：