在spark中读取athena表时出错



pyspark中有以下代码片段:

import pandas as pd
from pyspark import SparkContext, SparkConf
from pyspark.context import SparkContext
from pyspark.sql import Row, SQLContext, SparkSession
import pyspark.sql.dataframe
def validate_data():
conf = SparkConf().setAppName("app")
spark = SparkContext(conf=conf)
config = {
"val_path" : "s3://forecasting/data/validation.csv"
}
data1_df = spark.read.table("db1.data_dest”)
data2_df = spark.read.table("db2.data_source”)
print(data1_df.count())
print(data2_df.count())

if __name__ == "__main__":
validate_data()

现在这个代码在sagemaker上的jupyter笔记本上运行时工作正常(连接到EMR(

但是当我们在终端上作为python脚本运行时,它会抛出这个错误

错误消息

AttributeError: 'SparkContext' object has no attribute 'read'

我们必须自动化这些笔记本,所以我们正试图将它们转换为python脚本

只能在Spark会话上调用read,而不能在Spark上下文上调用。

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().setAppName("app")
spark = SparkSession.builder.config(conf=conf)

或者您可以将Spark上下文转换为Spark会话

conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

最新更新