Pyspark:使用令牌仪映射单词

我正在与pyspark一起开始旅程，我一直在埃及我有这样的代码：（我从https://spark.apache.org/docs/2.1.0/ml-features.html中获取它）

from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
sentenceDataFrame = spark.createDataFrame([
    (0, "Hi I heard about Spark"),
    (1, "I wish Java could use case classes"),
    (2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="\W")
# alternatively, pattern="\w+", gaps(False)
countTokens = udf(lambda words: len(words), IntegerType())
tokenized = tokenizer.transform(sentenceDataFrame)
tokenized.select("sentence", "words")
    .withColumn("tokens", countTokens(col("words"))).show(truncate=False)
regexTokenized = regexTokenizer.transform(sentenceDataFrame)
regexTokenized.select("sentence", "words") 
    .withColumn("tokens", countTokens(col("words"))).show(truncate=False)

我正在添加类似的东西：

test = sqlContext.createDataFrame([
    (0, "spark"),
    (1, "java"),
    (2, "i")
], ["id", "word"])

输出是：

id |sentence                           |words                                     |tokens|
+---+-----------------------------------+------------------------------------------+------+
|0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |5     |
|1  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7     |
|2  |Logistic,regression,models,are,neat|[logistic, regression, models, are, neat] |5     |

我有可能实现这样的目标：[来自'test'的ID，iD，来自'regextokenized']

2, 0
2, 1
1, 1
0, 1

从"测试"列表中，我可以从" regextokenized"中抓住" regextokenized'''单词'在两个数据集中映射的位置？或者应该采取其他解决方案？

提前感谢您的任何帮助：）

explode和 join：

 from pyspark.sql.functions import explode
(testTokenized.alias("train")
    .select("id", explode("words").alias("word"))
    .join(
        trainTokenized.select("id", explde("words").alias("word")).alias("test"), 
        "word"))

相关内容

最新更新

热门标签：