如何在PySpark中读取多个CSV文件并将它们合并到单个数据帧中



我有4个不同列的CSV文件。某些CSV也有相同的列名。csv的详细信息为:

capstone_customers.csv: [customer_id, customer_type, repeat_customer]
capstone_invoices.csv: [invoice_id,product_id,  customer_id, days_until_shipped,  product_line, total]
capstone_recent_customers.csv: [customer_id, customer_type]
capstone_recent_invoices.csv: [invoice_id,product_id,  customer_id, days_until_shipped,  product_line, total]

我的代码是:

df1 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_customers.csv")
df2 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_invoices.csv")
df3 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_recent_customers.csv")
df4 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_recent_invoices.csv")

from functools import reduce
def unite_dfs(df1, df2):
return df2.union(df1)

list_of_dfs = [df1, df2,df3,df4]
united_df = reduce(unite_dfs, list_of_dfs)

但我得到了错误:

只能对列数相同的表执行并集,但第一个表有6列,第二个表有3列;;\n'Union\n:-关系[invoice_id#234,product_id#235,customer_id#236,days_untl_shipped#237,product_line#238,total#239]csv\n+-关系[customer_id#218,customer_type#219,repeat_customer#220]csv\n

如何使用PySpark在单个数据帧中合并并删除相同的列名?

要在shark中读取多个文件,您可以列出所需的所有文件并一次读取,而不必按顺序读取。

以下是您可以使用的代码示例:

path = ['file.cvs','file.cvs']

df = spark.read.options(header=True).csv(path)
df.show()

您可以提供要读取的文件列表或文件路径,而不是逐个读取。不要忘记mergeSchema选项:

files = [
"capstone_customers.csv",
"capstone_invoices.csv",
"capstone_recent_customers.csv",
"capstone_recent_invoices.csv"
]
df = spark.read.options(inferSchema='True',header='True',delimiter=',', mergeSchema='True').csv(files)
# or
df = spark.read.options(inferSchema='True',header='True',delimiter=',',mergeSchema='True').csv('/path/to/files/')

相关内容

  • 没有找到相关文章

最新更新