以下在Cloudera CDSW群集网关上成功运行。
import pyspark
from pyspark.sql import SparkSession
spark = (SparkSession
.builder
.config("spark.jars.packages","JohnSnowLabs:spark-nlp:1.2.3")
.getOrCreate()
)
产生此输出。
Ivy Default Cache set to: /home/cdsw/.ivy2/cache
The jars for the packages stored in: /home/cdsw/.ivy2/jars
:: loading settings :: url = jar:file:/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
JohnSnowLabs#spark-nlp added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found JohnSnowLabs#spark-nlp;1.2.3 in spark-packages
found com.typesafe#config;1.3.0 in central
found org.fusesource.leveldbjni#leveldbjni-all;1.8 in central
downloading http://dl.bintray.com/spark-packages/maven/JohnSnowLabs/spark-nlp/1.2.3/spark-nlp-1.2.3.jar ...
[SUCCESSFUL ] JohnSnowLabs#spark-nlp;1.2.3!spark-nlp.jar (3357ms)
downloading https://repo1.maven.org/maven2/com/typesafe/config/1.3.0/config-1.3.0.jar ...
[SUCCESSFUL ] com.typesafe#config;1.3.0!config.jar(bundle) (348ms)
downloading https://repo1.maven.org/maven2/org/fusesource/leveldbjni/leveldbjni-all/1.8/leveldbjni-all-1.8.jar ...
[SUCCESSFUL ] org.fusesource.leveldbjni#leveldbjni-all;1.8!leveldbjni-all.jar(bundle) (382ms)
:: resolution report :: resolve 3836ms :: artifacts dl 4095ms
:: modules in use:
JohnSnowLabs#spark-nlp;1.2.3 from spark-packages in [default]
com.typesafe#config;1.3.0 from central in [default]
org.fusesource.leveldbjni#leveldbjni-all;1.8 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 3 | 3 | 3 | 0 || 3 | 3 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
3 artifacts copied, 0 already retrieved (5740kB/37ms)
Setting default log level to "ERROR".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
但是,当我尝试导入John Snow Labs上的Pyspark上所述的SparkNLP ...
import sparknlp
# or
from sparknlp.annotator import *
我明白了:
ImportError: No module named sparknlp
ImportError: No module named sparknlp.annotator
使用SparkNLP我需要做什么?当然,这可以对任何火花包进行推广。
您可以使用命令中的PySpark中使用SparkNLP软件包:
pyspark --packages JohnSnowLabs:spark-nlp:1.3.0
,但这并不能告诉Python在哪里可以找到绑定。按照此处的类似报告的说明,可以通过将JAR目录添加到您的pythonpath:
来修复:export PYTHONPATH="~/.ivy2/jars/JohnSnowLabs_spark-nlp-1.3.0.jar:$PYTHONPATH"
或
import sys, glob, os
sys.path.extend(glob.glob(os.path.join(os.path.expanduser("~"), ".ivy2/jars/*.jar")))
我弄清楚了。正确加载的JAR文件仅是编译的Scala文件。我仍然必须将包含包装代码的Python文件放在我可以从中导入的位置。一旦我这样做,一切都很好。
感谢粘土。以下是我设置pythonpath的方式:
git clone --branch 3.0.3 https://github.com/JohnSnowLabs/spark-nlp
export PYTHONPATH="./spark-nlp/python:$PYTHONPATH"
然后它对我有用,因为我的./spark-nlp/python文件夹现在包含难以捉摸的SparkNLP模块。
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.0.3
>>> import sparknlp
>>>