错误为:-ModuleNotFoundError:没有名为' pyspark '的模块,而在docker中运行pyspa



获取错误为:

回溯(最近一次调用):文件"/opt/application/main.py ",第6行从pyspark导入SparkConf, SparkContext ModuleNotFoundError: No module named ' pyspark '

在docker中运行pyspark。

和我的dockerfile如下:

FROM centos
ENV DAEMON_RUN=true
ENV SPARK_VERSION=2.4.7
ENV HADOOP_VERSION=2.7
WORKDIR /opt/application
RUN yum -y install python36
RUN yum -y install wget
ENV PYSPARK_PYTHON python3.6
ENV PYSPARK_DRIVER_PYTHON python3.6
RUN ln -s /usr/bin/python3.6 /usr/local/bin/python
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python get-pip.py
RUN pip3.6 install numpy
RUN pip3.6 install pandas
RUN wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz 
&& mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark 
&& rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
ENV SPARK_HOME=/usr/local/bin/spark
RUN yum -y install java-1.8.0-openjdk
ENV JAVA_HOME /usr/lib/jvm/jre
COPY main.py .
RUN chmod +x /opt/application/main.py
CMD ["/opt/application/main.py"]

您忘记在dockerfile中安装pyspark

FROM centos
ENV DAEMON_RUN=true
ENV SPARK_VERSION=2.4.7
ENV HADOOP_VERSION=2.7
WORKDIR /opt/application
RUN yum -y install python36
RUN yum -y install wget
ENV PYSPARK_PYTHON python3.6
ENV PYSPARK_DRIVER_PYTHON python3.6
RUN ln -s /usr/bin/python3.6 /usr/local/bin/python
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python get-pip.py
RUN pip3.6 install numpy
RUN pip3.6 install pandas
RUN pip3.6 install pyspark  # add this line.
RUN wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz 
&& mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark 
&& rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
ENV SPARK_HOME=/usr/local/bin/spark
RUN yum -y install java-1.8.0-openjdk
ENV JAVA_HOME /usr/lib/jvm/jre
COPY main.py .
RUN chmod +x /opt/application/main.py
CMD ["/opt/application/main.py"]

编辑: dockerfile改进:

FROM centos
ENV DAEMON_RUN=true
ENV SPARK_VERSION=2.4.7
ENV HADOOP_VERSION=2.7
WORKDIR /opt/application
RUN yum -y install python36 wget java-1.8.0-openjdk  # you could install python36 and wget in once
ENV PYSPARK_PYTHON python3.6
ENV PYSPARK_DRIVER_PYTHON python3.6
RUN ln -s /usr/bin/python3.6 /usr/local/bin/python
RUN wget https://bootstrap.pypa.io/get-pip.py 
&& python get-pip.py 
&& pip3.6 install numpy==1.19 pandas==1.1.5 pyspark==3.0.2  # you should also pin the version you need, pandas 1.2.x does not support python 3.6
RUN wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz 
&& mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark 
&& rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
ENV SPARK_HOME=/usr/local/bin/spark
ENV JAVA_HOME /usr/lib/jvm/jre
COPY main.py .
RUN chmod +x /opt/application/main.py
CMD ["/opt/application/main.py"]

相关内容

最新更新