将list转换为dataframe,然后在pyspark中与不同的dataframe连接



我正在使用pyspark dataframes

我有一个日期类型值列表:

date_list = ['2018-01-19', '2018-01-20', '2018-01-17']

我也有一个数据帧(mean_df),只有一列(平均值)。

+----+
|mean|
+----+
|67  |
|78  |
|98  |
+----+

现在我想把date_list转换成一个列,并与mean_df:

连接预期输出:

+------------+----+
|dates       |mean|
+------------+----+
|2018-01-19  |  67|
|2018-01-20  |  78|
|2018-01-17  |  98|
+------------+----+

我尝试将list转换为dataframe (date_df):

date_df = spark.createDataFrame([(l,) for l in date_list], ['dates'])

,然后使用新的列名"idx"对于date_df和mean_df和使用的join:

date_df = mean_df.join(date_df, mean_df.idx == date_df.idx).drop("idx")

我得到超时错误,所以我把默认的broadcastTimeout 300s改为6000s

spark.conf.set("spark.sql.broadcastTimeout", 6000)

但它根本不起作用。而且我现在正在处理一个非常小的数据样本。实际数据已经足够大了。

代码片段:

date_list = ['2018-01-19', '2018-01-20', '2018-01-17']
mean_list = []

for d in date_list:
h2_df1, h2_df2 = hypo_2(h2_df, d, 2)

mean1 = h2_df1.select(_mean(col('count_before')).alias('mean_before'))

mean_list.append(mean1)


mean_df = reduce(DataFrame.unionAll, mean_list)

您可以使用withColumnlit将日期添加到数据框中:

import pyspark.sql.functions as F
date_list = ['2018-01-19', '2018-01-20', '2018-01-17']
mean_list = []
for d in date_list:
h2_df1, h2_df2 = hypo_2(h2_df, d, 2)

mean1 = h2_df1.select(F.mean(F.col('count_before')).alias('mean_before')).withColumn('date', F.lit(d))

mean_list.append(mean1)


mean_df = reduce(DataFrame.unionAll, mean_list)

最新更新