连接两个pyarrow表



我有一个orc与数据后。

表答:

Name    age     school      address      phone
tony    12      havard      UUU          666
tommy   13      abc         Null         Null
john    14      cde         Null         Null
john    14      cde         Null         Null

表B:姓名地址电话

tommy   USD         345   
john    ASA         444

连接后的期望表:姓名年龄学校地址电话

tony    12      havard      UUU          666
tommy   13      abc         USD          345
john    14      cde         ASA          444
john    14      cde         ASA          444

我怎么能和pyarrow或pandas一起做呢表a的名称不唯一,表B的名称唯一

试试这个:

dfA.set_index('Name', inplace=True)
dfA.update(dfB.set_index('Name'))
dfA.reset_index()

注意:这个"Name"列应该有@Antti Haapala提到的唯一值—Слава Україні

当A和B的"Name"对应的"Address"one_answers"Phone"值不同时,表A的值将被表B中的值更新

在pyarrow中,从8.0.0开始,您可以使用join和coalesce的组合来完成此操作。

import pyarrow as pa
import pyarrow.compute as pc
table_a = pa.Table.from_pydict({
"name": ["tony", "tommy", "john"],
"age": [12, 13, 14],
"school": ["havard", "abc", "cde"],
"address": ["UUU", None, None],
"phone": [666, None, None]
})
table_b = pa.Table.from_pydict({
"name": ["tommy", "john"],
"address": ["USD", "ASA"],
"phone": [345, 444]
})
combined = table_a.join(table_b, 'name', right_suffix='_r')
coalesced_addrs = pc.coalesce(combined.column('address_r'), combined.column('address'))
coalesced_phone = pc.coalesce(combined.column('phone_r'), combined.column('phone'))
result = pa.Table.from_pydict({
'name': combined.column('name'),
'age': combined.column('age'),
'school': combined.column('school'),
'address': coalesced_addrs,
'phone': coalesced_phone
})
print(result)

最新更新