当数据框架具有相似的列名时存储数据框架差异



我试图验证计数列从两个数据帧有一个相等的计数列,如果有不匹配,那么它应该存储在一个单独的数据帧为一个特定的数据帧分别如果第一个数据帧的行不存在或计数不相等。

SQL解决方案

数据创建:

CREATE TABLE A (
a_row VARCHAR(60) NOT NULL,
counting INT NOT NULL
);
INSERT INTO A 
(a_row, counting) 
VALUES 
("abc|def|2022-05-27|09:05:17.023|12345",2),
("xyz|def|2022-05-27|09:05:17.023|13432",2),
("lkj|def|2022-05-07|09:05:17.023|14362",3);

CREATE TABLE B (
b_row VARCHAR(60) NOT NULL,
counting INT NOT NULL
);
INSERT INTO B 
(b_row, counting) 
VALUES 
("abc|def|2022-05-27|09:05:17.023|12345",2),
("xyz|def|2022-05-27|09:05:17.023|13432",1),
("poi|def|2022-05-27|09:50:17.023|450545",2);

查询:

-- rows of A not repeated exactly similar in B
select a.a_row, a.counting from A as a
left outer join B as b 
on b.b_row = a.a_row
where b.counting  != a.counting or b.counting is Null;;
-- rows of B not repeated exactly similar in A
select b.b_row, b.counting
from B as b
left outer join A as a
on a.a_row = b.b_row
where b.counting  != a.counting or a.counting is Null;

结果:

a_row                          counting
xyz|def|2022-05-27|09:05:17.023|13432     1
lkj|def|2022-05-07|09:05:17.023|14362     3
b_row                           counting
xyz|def|2022-05-27|09:05:17.023|13432     2
poi|def|2022-05-27|09:50:17.023|450545    2

这些是我希望将其存储在单独的数据帧中的答案。

现在在PySpark中创建数据帧

数据帧:A

+--------------------------------------------+-----+
|a_row                                       |count|
+--------------------------------------------+-----+
abc|def|2022-05-27|09:05:17.023|12345        | 2   |
xyz|def|2022-05-27|09:05:17.023|13432        | 2   |
lkj|def|2022-05-07|09:05:17.023|14362        | 3   |
+--------------------------------------------+-----+

数据帧:B

+--------------------------------------------+-----+
|b_row                                       |count|
+-------------------------------------- -----+-----+
abc|def|2022-05-27|09:05:17.023|12345        | 2   |
xyz|def|2022-05-27|09:05:17.023|13432        | 1   |
lkj|def|2022-05-07|09:05:17.023|14362        | 3   |
+--------------------------------------------+-----+

目前为止我在python 3.7

# finding the rows of A not found in B or count is not equal
A_difference = A.join(B, how='left', on='A.a_row = B.b_row').where('A.counting  != A.counting or B.counting is Null')
# finding the rows of B not found in A or count is not equal
B_difference = B.join(A, how='left', on='A.a_row = B.b_row').where('A.counting  != A.counting or B.counting is Null')

错误:

pyspark.sql.utils。在连接的左侧不能解析USING列A.a_row = B.b_row。左侧列:[a_row, count]

我不知道如何得到预期的输出,我得到了类似的结果从SQL查询。

主要是你引用了A.col,但Spark不知道A是什么。您必须首先为数据框架创建alias。此外,on子句语法错误。

A_difference = (A.alias('A')
.join(
B.alias('B'),
how='left',
on=A.a_row == B.b_row)
.where('B.counting != A.counting or B.counting is Null')
.select('A.*')
)
B_difference = (B.alias('B')
.join(
A.alias('A'),
how='left',
on=A.a_row == B.b_row)
.where('B.counting != A.counting or A.counting is Null')
.select('B.*')
)

相关内容

  • 没有找到相关文章

最新更新