Docker NLTK Download



我正在使用以下Dockerfile构建一个docker容器:

FROM ubuntu:14.04
RUN apt-get update
RUN apt-get install -y python python-dev python-pip
ADD . /app
RUN apt-get install -y python-scipy
RUN pip install -r /arrc/requirements.txt
EXPOSE 5000
WORKDIR /app
CMD python app.py

一切都很顺利,直到我运行图像并得到以下错误:

**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/root/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - u''
**********************************************************************

我以前遇到过这个问题,这里会讨论它,但我不确定如何使用Docker来解决它。我试过:

CMD python
CMD import nltk
CMD nltk.download()

以及:

CMD python -m nltk.downloader -d /usr/share/nltk_data popular

但我还是犯了错误。

在Dockerfile中,尝试添加:

RUN python -m nltk.downloader punkt

这将运行命令并将请求的文件安装到//nltk_data/

这个问题很可能与在Dockerfile中使用CMD与RUN有关。CMD:文档

CMD的主要目的是为正在执行的容器提供默认值。

其在docker run <image>期间而不是在构建期间使用。因此,其他CMD行可能被最后一个CMD python app.py行覆盖。

我尝试了所有建议的方法,但都不起作用,所以我意识到nltk模块在/root/nltk_data 中搜索

第一步:我在机器上下载了punkt通过使用

python3
>>import nltk
>>nltk.download('punkt')

punkt在/root/nltk_data/tokenizer 中

步骤2:我将tokenizer文件夹复制到我的控制器我的目录看起来像这个

.
|-app/
|-tokenizers/
|--punkt/
|---all those pkl files
|--punkt.zip

以及步骤3:然后我修改了Dockerfile,它将其复制到我的docker实例中

COPY ./tokenizers /root/nltk_data/tokenizers

步骤4:新实例具有punkt

当我为django应用程序创建带有ubuntu映像和python3的docker映像时,我也遇到了同样的问题。

我决定如下所示。

# start from an official image
FROM ubuntu:16.04
RUN apt-get update 
  && apt-get install -y python3-pip python3-dev 
  && apt-get install -y libmysqlclient-dev python3-virtualenv
# arbitrary location choice: you can change the directory
RUN mkdir -p /opt/services/djangoapp/src
WORKDIR /opt/services/djangoapp/src
# copy our project code
COPY . /opt/services/djangoapp/src
# install dependency for running service
RUN pip3 install -r requirements.txt
RUN python3 -m nltk.downloader punkt
RUN python3 -m nltk.downloader wordnet
# Setup supervisord
RUN mkdir -p /var/log/supervisor
COPY supervisord.conf /etc/supervisor/conf.d/supervisord.conf
# Start processes
CMD ["/usr/bin/supervisord"]

我通过在容器中指示下载目的地来为谷歌云构建工作。

RUN [ "python3", "-c", "import nltk; nltk.download('punkt', download_dir='/usr/local/nltk_data')" ]

完整Dockerfile

FROM python:3.8.3
WORKDIR /app
ADD . /app
# install requirements
RUN pip3 install --upgrade pip
RUN pip3 install --no-cache-dir --compile -r requirements.txt
RUN [ "python3", "-c", "import nltk; nltk.download('punkt', download_dir='/usr/local/nltk_data')" ]
CMD exec uvicorn --host 0.0.0.0 --port $PORT main:app

目前我不得不这样做:请参阅RUN cp -r /root/nltk_data /usr/local/share/nltk_data

FROM ubuntu:latest
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get clean && apt-get update && apt-get install -y locales
RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8
RUN apt-get -y update && apt-get install -y --no-install-recommends 
    sudo 
    python3 
    build-essential 
    python3-pip 
    python3-setuptools 
    python3-dev 
    && rm -rf /var/lib/apt/lists/*
 
RUN pip3 install --upgrade pip
ENV PYTHONPATH "${PYTHONPATH}:/app"
ADD requirements.txt .
# in requirements.txt: pandas, numpy, wordcloud, matplotlib, nltk, sklearn
RUN pip3 install -r requirements.txt 
RUN [ "python3", "-c", "import nltk; nltk.download('stopwords')" ]
RUN [ "python3", "-c", "import nltk; nltk.download('punkt')" ]
RUN cp -r /root/nltk_data /usr/local/share/nltk_data 
RUN addgroup --system app 
    && adduser --system --ingroup app app
WORKDIR /home/app
ADD inputfile .
ADD script.py . 
# the script uses the python modules: pandas, numpy, wordcloud, matplotlib, nltk, sklearn
RUN chown app:app -R /home/app
USER app
RUN python3 script.py inputfile outputfile

最新更新