Spark|pyspark与fb先知-并行处理不适用于rdd.map

我试图用pyspark实现fb先知，但无法在所有可用的内核上并行化代码(在我的机器上本地运行(。

我已经搜索了各种各样的文章，试图理解为什么会发生这种情况。

下面可以找到应该进行并行化的代码块。我已经定义了所有映射函数

if __name__ == '__main__':
conf = (SparkConf()
.setMaster("local[*]")
.setAppName("SparkFBProphet Example"))
spark = (SparkSession
.builder
.config(conf=conf)
.getOrCreate())
# Removes some of the logging after session creation so we can still see output
# Doesnt remove logs before/during session creation
# To edit more logging you will need to set in log4j.properties on cluster
sc = spark.sparkContext
sc.setLogLevel("ERROR")
# Retrieve data from local csv datastore
print(compiling_pickle())
df = retrieve_data()
# Group data by app and metric_type to aggregate data for each app-metric combo
df = df.groupBy('column1', 'column2')
df = df.agg(collect_list(struct('ds', 'y')).alias('data'))

df = (df.rdd
.map(lambda r: transform_data(r))
.map(lambda d: partition_data(d))
.map(lambda d: create_model(d))
.map(lambda d: train_model(d))
.map(lambda d: make_forecast(d))
.map(lambda d: imp_predictions(d))
.saveAsTextFile("../data_spark_t/results"))
spark.stop()

在本节中：

print(compiling_pickle())
df = retrieve_data()

加载、编译pickle并生成csv。有了检索功能，我只做这个：

df = (spark.read.option("header", "true")
.option("inferSchema", value=True)
.csv("../data_spark_t/database_created.csv"))

所以，我不明白为什么我的代码没有在执行时附加所有可用的核心。

只是为了指出一些已经测试过的点：

我的分数是500。我已经将其设置为df中的行数(在"collect_list"之后(，但没有成功；
setMaster((的所有可能组合都已实现；

有人可以帮忙吗？

问题已解决：

schema = StructType([
StructField("column 1", StringType(), True),
StructField("column 2", StringType(), True),
StructField("column 3", TimestampType(), True),
StructField("yhat", FloatType(), True),
StructField("yhat_lower", FloatType(), True),
StructField("yhat_upper", FloatType(), True),
])
df = spark.createDataFrame(df, schema)
df.write.options(header=True).csv(
'dbfs:/mnt/location/output_teste_1', mode='overwrite')

只需要使用上述结构保存即可。

在Azure数据块上实现了这一点，代码完成了任务，启动了所有可用的节点。

相关内容

最新更新

热门标签：