Installing AWS Glue ETL Library



问题

在设置了AWS胶水库后,我面临以下错误:

PS C:Users[user]Documents[company]projectscodedata-lakeetltealium> python visitor.py
20/04/05 19:33:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
File "visitor.py", line 9, in <module>
glueContext = GlueContext(sc.getOrCreate())
File "C:Users[user]Documents[company]projectscodeaws-glue-libs-glue-1.0PyGlue.zipawsgluecontext.py", line 45, in __init__
File "C:Users[user]Documents[company]projectscodeaws-glue-libs-glue-1.0PyGlue.zipawsgluecontext.py", line 66, in _get_glue_scala_context
TypeError: 'JavaPackage' object is not callable

场景

我正在尝试使用PIPENV在虚拟环境中安装AWS GLue ETL库。因此,我得到了以下带有环境变量的.env文件:

HADOOP_HOME="C:Users[user]AppDataLocalSparkwinutils"
SPARK_HOME="C:Users[user]AppDataLocalSparkspark-2.4.3-bin-hadoop2.8spark-2.4.3-bin-hadoop2.8spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8"
JAVA_HOME="C:Program FilesJavajdk1.8.0_231"
PATH="${HADOOP_HOME}bin"
PATH="${SPARK_HOME}bin:${PATH}"
PATH="${JAVA_HOME}bin:${PATH}"
SPARK_CONF_DIR="C:Users[user]Documents[company]projectscodeaws-glue-libs-glue-1.0conf"
PYTHONPATH="${SPARK_HOME}/python/:${PYTHONPATH}"
PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.7-src.zip:${PYTHONPATH}"
PYTHONPATH="C:/Users/[user]/Documents/[company]/projects/code/aws-glue-libs-glue-1.0/PyGlue.zip:${PYTHONPATH}" 

我最初的代码非常简单,我只创建了如下所示的Glue上下文:

from awsglue.context import GlueContext
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.conf import SparkConf
sc = SparkContext()
glueContext = GlueContext(sc.getOrCreate())
print(glueContext)
print(sc)

你们知道这个问题可能是什么吗?

试试这个:

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
sc = SparkContext.getOrCreate()
sc.setLogLevel('INFO')
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
spark = glueContext.spark_session()
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

此外,如果你创建了新的粘合作业,它将为你提供样板代码,解决你的问题。。。

最新更新