我正试图在DataFrame中创建一个新列,如果另一列的值在另一个DataFrame的列中,则该列将为"true"。我尝试过以下操作,但我认为isin()
的语法是错误的,因为我传递的是一个带有单列的DataFrame。
客户:
customer_id name
1 John
2 Mary
3 Jane
4 Jack
5 Emma
customer_referred_customer:
from to
1 3
2 4
结果:
customer_id name is_referral
1 John false
2 Mary false
3 Jane true
4 Jack true
5 Emma false
尝试:
customers.withColumn(
"is_referral",
F.when(
F.col("customer_id").isin(
customer_referred_customer.select("to")
),
F.lit("true"),
).otherwise(F.lit("false")),
)
我该怎么解决这个问题?
我会这样做:
customers.join(
customer_referred_customer,
customers.customer_id ==customer_referred_customer.to,
"left")
.withColumn("is_referral",
f.when(customer_referred_customer["to"].isNull(),f.lit("false"))
.otherwise(f.lit("true"))
.select(customers["customer_id"],customers["name"], "is_referral")
使用半联接和反联接。你没有提供数据,所以我不能测试,但代码的想法是:
customers = customers.join(
customer_referred_customer,
customers.customer_id == customer_referred_customer.to,
'left_semi'
).withColumn(
'is_referral',
F.lit('true')
).unionAll(
customers.join(
customer_referred_customer,
customers.customer_id == customer_referred_customer.to,
'left_anti'
).withColumn(
'is_referral',
F.lit('false')
)
)
创建检查列的列表并使用.isi((
df.withColumn('is_referral', df.customer_id.isin(df1.select("to").rdd.flatMap(list).collect())).show()
+-----------+----+-----------+
|customer_id|name|is_referral|
+-----------+----+-----------+
| 1|John| false|
| 2|Mary| false|
| 3|Jane| true|
| 4|Jack| true|
| 5|Emma| false|
+-----------+----+-----------+
使用full outer
联接&则使用CCD_ 4 导出新列CCD_
检查以下代码。
customers
.join(customer_referred_customer,customers.customer_id == customer_referred_customer.to,"full")
.withColumn("is_referral",col("to").isNotNull())
.select("customer_id","name","is_referral")
.orderBy(col("customer_id").asc())
.show(false)
+-----------+----+-----------+
|customer_id|name|is_referral|
+-----------+----+-----------+
|1 |John|false |
|2 |Mary|false |
|3 |Jane|true |
|4 |Jack|true |
|5 |Emma|false |
+-----------+----+-----------+