我想在pyspark中的2个数据帧之间进行减去。挑战是我在减去数据框架时必须忽略某些列。但是End DataFrame应该具有所有列,包括忽略的列。
这是一个示例:
userLeft = sc.parallelize([
Row(id=u'1',
first_name=u'Steve',
last_name=u'Kent',
email=u's.kent@email.com',
date1=u'2017-02-08'),
Row(id=u'2',
first_name=u'Margaret',
last_name=u'Peace',
email=u'marge.peace@email.com',
date1=u'2017-02-09'),
Row(id=u'3',
first_name=None,
last_name=u'hh',
email=u'marge.hh@email.com',
date1=u'2017-02-10')
]).toDF()
userRight = sc.parallelize([
Row(id=u'2',
first_name=u'Margaret',
last_name=u'Peace',
email=u'marge.peace@email.com',
date1=u'2017-02-11'),
Row(id=u'3',
first_name=None,
last_name=u'hh',
email=u'marge.hh@email.com',
date1=u'2017-02-12')
]).toDF()
预期:
ActiveDF = userLeft.subtract(userRight) ||| Ignore "date1" column while subtracting.
最终结果应该看起来像这样的内容,包括" date1"列。
+----------+--------------------+----------+---+---------+
| date1| email|first_name| id|last_name|
+----------+--------------------+----------+---+---------+
|2017-02-08| s.kent@email.com| Steve| 1| Kent|
+----------+--------------------+----------+---+---------+
似乎您需要anti-join
:
userLeft.join(userRight, ["id"], "leftanti").show()
+----------+----------------+----------+---+---------+
| date1| email|first_name| id|last_name|
+----------+----------------+----------+---+---------+
|2017-02-08|s.kent@email.com| Steve| 1| Kent|
+----------+----------------+----------+---+---------+
您还可以使用full join
,只保留null
值:
userLeft.join(
userRight,
[c for c in userLeft.columns if c != "date1"],
"full"
).filter(psf.isnull(userLeft.date1) | psf.isnull(userRight.date1)).show()
+------------------+----------+---+---------+----------+----------+
| email|first_name| id|last_name| date1| date1|
+------------------+----------+---+---------+----------+----------+
|marge.hh@email.com| null| 3| hh|2017-02-10| null|
|marge.hh@email.com| null| 3| hh| null|2017-02-12|
| s.kent@email.com| Steve| 1| Kent|2017-02-08| null|
+------------------+----------+---+---------+----------+----------+
如果要使用加入,无论是leftanti
还是full
,都需要在"联接列"中找到null
的默认值(我认为我们在上一个线程中讨论了它)。
您也可以只能drop
打扰您subtract
和join
的列:
df = userLeft.drop("date1").subtract(userRight.drop("date1"))
userLeft.join(df, df.columns).show()
+----------------+----------+---+---------+----------+
| email|first_name| id|last_name| date1|
+----------------+----------+---+---------+----------+
|s.kent@email.com| Steve| 1| Kent|2017-02-08|
+----------------+----------+---+---------+----------+