获取错误为:
回溯(最近一次调用):文件"/opt/application/main.py ",第6行从pyspark导入SparkConf, SparkContext ModuleNotFoundError: No module named ' pyspark '
在docker中运行pyspark。
和我的dockerfile如下:
FROM centos
ENV DAEMON_RUN=true
ENV SPARK_VERSION=2.4.7
ENV HADOOP_VERSION=2.7
WORKDIR /opt/application
RUN yum -y install python36
RUN yum -y install wget
ENV PYSPARK_PYTHON python3.6
ENV PYSPARK_DRIVER_PYTHON python3.6
RUN ln -s /usr/bin/python3.6 /usr/local/bin/python
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python get-pip.py
RUN pip3.6 install numpy
RUN pip3.6 install pandas
RUN wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
&& mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark
&& rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
ENV SPARK_HOME=/usr/local/bin/spark
RUN yum -y install java-1.8.0-openjdk
ENV JAVA_HOME /usr/lib/jvm/jre
COPY main.py .
RUN chmod +x /opt/application/main.py
CMD ["/opt/application/main.py"]
您忘记在dockerfile中安装pyspark
。
FROM centos
ENV DAEMON_RUN=true
ENV SPARK_VERSION=2.4.7
ENV HADOOP_VERSION=2.7
WORKDIR /opt/application
RUN yum -y install python36
RUN yum -y install wget
ENV PYSPARK_PYTHON python3.6
ENV PYSPARK_DRIVER_PYTHON python3.6
RUN ln -s /usr/bin/python3.6 /usr/local/bin/python
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python get-pip.py
RUN pip3.6 install numpy
RUN pip3.6 install pandas
RUN pip3.6 install pyspark # add this line.
RUN wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
&& mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark
&& rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
ENV SPARK_HOME=/usr/local/bin/spark
RUN yum -y install java-1.8.0-openjdk
ENV JAVA_HOME /usr/lib/jvm/jre
COPY main.py .
RUN chmod +x /opt/application/main.py
CMD ["/opt/application/main.py"]
编辑: dockerfile改进:
FROM centos
ENV DAEMON_RUN=true
ENV SPARK_VERSION=2.4.7
ENV HADOOP_VERSION=2.7
WORKDIR /opt/application
RUN yum -y install python36 wget java-1.8.0-openjdk # you could install python36 and wget in once
ENV PYSPARK_PYTHON python3.6
ENV PYSPARK_DRIVER_PYTHON python3.6
RUN ln -s /usr/bin/python3.6 /usr/local/bin/python
RUN wget https://bootstrap.pypa.io/get-pip.py
&& python get-pip.py
&& pip3.6 install numpy==1.19 pandas==1.1.5 pyspark==3.0.2 # you should also pin the version you need, pandas 1.2.x does not support python 3.6
RUN wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
&& mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark
&& rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
ENV SPARK_HOME=/usr/local/bin/spark
ENV JAVA_HOME /usr/lib/jvm/jre
COPY main.py .
RUN chmod +x /opt/application/main.py
CMD ["/opt/application/main.py"]