我有以下数据框架
+------------------+----------+------------------+
| antecedent|consequent| confidence|
+------------------+----------+------------------+
| [7, 2, 0]| [8]|0.6237623762376238|
| [7, 2, 0]| [1]| 1.0|
| [7, 2, 0]| [5]|0.9975247524752475|
| [7, 2, 0]| [3]|0.9975247524752475|
| [7, 2, 0]| [4]|0.9975247524752475|
| [7, 2, 0]| [6]| 0.995049504950495|
| [6, 5, 3, 4]| [8]| 0.623721881390593|
| [6, 5, 3, 4]| [1]| 1.0|
| [6, 5, 3, 4]| [2]| 1.0|
| [6, 5, 3, 4]| [0]| 1.0|
| [6, 5, 3, 4]| [7]| 0.820040899795501|
|[9, 8, 6, 5, 1, 2]| [0]| 1.0|
|[9, 8, 6, 5, 1, 2]| [3]| 1.0|
|[9, 8, 6, 5, 1, 2]| [4]| 1.0|
| [7, 3, 1]| [8]|0.6228287841191067|
| [7, 3, 1]| [5]| 1.0|
| [7, 3, 1]| [2]| 1.0|
| [7, 3, 1]| [0]| 1.0|
| [7, 3, 1]| [4]| 1.0|
| [7, 3, 1]| [6]|0.9950372208436724|
+------------------+----------+------------------+
我想对其进行一些疑问,例如,在不包含[7,3]的情况下,我尝试了此查询,但是这似乎是错误的,因为7,3是整数编号
from pyspark.sql.functions import *
q = r.filter(~col('antecedent').isin([7,3])).show()
错误:
"condition should be string or Column"
您可以编写一个udf
功能以检查条件(检查 7或3 的条件是否存在于先前的column
(为
from pyspark.sql import functions as F
from pyspark.sql import types as T
def checkIsIn(array):
return True in [x in array for x in [7, 3]]
udfCheckIsIn = F.udf(checkIsIn, T.BooleanType())
然后将其用作
的过滤器q = r.filter(udfCheckIsIn(r.antecedent)).show()
您应该有输出为
+------------+----------+------------------+
| antecedent|consequent| confidence|
+------------+----------+------------------+
| [7, 2, 0]| [8]|0.6237623762376238|
| [7, 2, 0]| [1]| 1.0|
| [7, 2, 0]| [5]|0.9975247524752475|
| [7, 2, 0]| [3]|0.9975247524752475|
| [7, 2, 0]| [4]|0.9975247524752475|
| [7, 2, 0]| [6]| 0.995049504950495|
|[6, 5, 3, 4]| [8]| 0.623721881390593|
|[6, 5, 3, 4]| [1]| 1.0|
|[6, 5, 3, 4]| [2]| 1.0|
|[6, 5, 3, 4]| [0]| 1.0|
|[6, 5, 3, 4]| [7]| 0.820040899795501|
| [7, 3, 1]| [8]|0.6228287841191067|
| [7, 3, 1]| [5]| 1.0|
| [7, 3, 1]| [2]| 1.0|
| [7, 3, 1]| [0]| 1.0|
| [7, 3, 1]| [4]| 1.0|
| [7, 3, 1]| [6]|0.9950372208436724|
+------------+----------+------------------+
用于检查条件(检查 7和3 的条件,都存在于 column
中(
from pyspark.sql import functions as F
from pyspark.sql import types as T
def checkIsIn(array):
return False not in [x in array for x in [7, 3]]
udfCheckIsIn = F.udf(checkIsIn, T.BooleanType())
然后将其用作
的过滤器q = r.filter(udfCheckIsIn(r.antecedent)).show()
您应该有输出为
+----------+----------+------------------+
|antecedent|consequent| confidence|
+----------+----------+------------------+
| [7, 3, 1]| [8]|0.6228287841191067|
| [7, 3, 1]| [5]| 1.0|
| [7, 3, 1]| [2]| 1.0|
| [7, 3, 1]| [0]| 1.0|
| [7, 3, 1]| [4]| 1.0|
| [7, 3, 1]| [6]|0.9950372208436724|
+----------+----------+------------------+