在Pyspark数据框架列表中调用项目



我有以下数据框架

+------------------+----------+------------------+
|        antecedent|consequent|        confidence|
+------------------+----------+------------------+
|         [7, 2, 0]|       [8]|0.6237623762376238|
|         [7, 2, 0]|       [1]|               1.0|
|         [7, 2, 0]|       [5]|0.9975247524752475|
|         [7, 2, 0]|       [3]|0.9975247524752475|
|         [7, 2, 0]|       [4]|0.9975247524752475|
|         [7, 2, 0]|       [6]| 0.995049504950495|
|      [6, 5, 3, 4]|       [8]| 0.623721881390593|
|      [6, 5, 3, 4]|       [1]|               1.0|
|      [6, 5, 3, 4]|       [2]|               1.0|
|      [6, 5, 3, 4]|       [0]|               1.0|
|      [6, 5, 3, 4]|       [7]| 0.820040899795501|
|[9, 8, 6, 5, 1, 2]|       [0]|               1.0|
|[9, 8, 6, 5, 1, 2]|       [3]|               1.0|
|[9, 8, 6, 5, 1, 2]|       [4]|               1.0|
|         [7, 3, 1]|       [8]|0.6228287841191067|
|         [7, 3, 1]|       [5]|               1.0|
|         [7, 3, 1]|       [2]|               1.0|
|         [7, 3, 1]|       [0]|               1.0|
|         [7, 3, 1]|       [4]|               1.0|
|         [7, 3, 1]|       [6]|0.9950372208436724|
+------------------+----------+------------------+

我想对其进行一些疑问,例如,在不包含[7,3]的情况下,我尝试了此查询,但是这似乎是错误的,因为7,3是整数编号

from pyspark.sql.functions import *
q = r.filter(~col('antecedent').isin([7,3])).show()

错误:

"condition should be string or Column"

您可以编写一个udf功能以检查条件(检查 7或3 的条件是否存在于先前的column (为

from pyspark.sql import functions as F
from pyspark.sql import types as T
def checkIsIn(array):
    return True in [x in array for x in [7, 3]]
udfCheckIsIn = F.udf(checkIsIn, T.BooleanType())

然后将其用作

的过滤器
q = r.filter(udfCheckIsIn(r.antecedent)).show()

您应该有输出为

+------------+----------+------------------+
|  antecedent|consequent|        confidence|
+------------+----------+------------------+
|   [7, 2, 0]|       [8]|0.6237623762376238|
|   [7, 2, 0]|       [1]|               1.0|
|   [7, 2, 0]|       [5]|0.9975247524752475|
|   [7, 2, 0]|       [3]|0.9975247524752475|
|   [7, 2, 0]|       [4]|0.9975247524752475|
|   [7, 2, 0]|       [6]| 0.995049504950495|
|[6, 5, 3, 4]|       [8]| 0.623721881390593|
|[6, 5, 3, 4]|       [1]|               1.0|
|[6, 5, 3, 4]|       [2]|               1.0|
|[6, 5, 3, 4]|       [0]|               1.0|
|[6, 5, 3, 4]|       [7]| 0.820040899795501|
|   [7, 3, 1]|       [8]|0.6228287841191067|
|   [7, 3, 1]|       [5]|               1.0|
|   [7, 3, 1]|       [2]|               1.0|
|   [7, 3, 1]|       [0]|               1.0|
|   [7, 3, 1]|       [4]|               1.0|
|   [7, 3, 1]|       [6]|0.9950372208436724|
+------------+----------+------------------+

用于检查条件(检查 7和3 的条件,都存在于 column 中(

from pyspark.sql import functions as F
from pyspark.sql import types as T
def checkIsIn(array):
    return False not in [x in array for x in [7, 3]]
udfCheckIsIn = F.udf(checkIsIn, T.BooleanType())

然后将其用作

的过滤器
q = r.filter(udfCheckIsIn(r.antecedent)).show()

您应该有输出为

+----------+----------+------------------+
|antecedent|consequent|        confidence|
+----------+----------+------------------+
| [7, 3, 1]|       [8]|0.6228287841191067|
| [7, 3, 1]|       [5]|               1.0|
| [7, 3, 1]|       [2]|               1.0|
| [7, 3, 1]|       [0]|               1.0|
| [7, 3, 1]|       [4]|               1.0|
| [7, 3, 1]|       [6]|0.9950372208436724|
+----------+----------+------------------+

最新更新