我有两个大的csv文件(两个文件都有大约一百万行不同的列名,一个文件中大约有70列(。我想使用python-panda执行左联接(类似sql(,并创建一个新的csv文件。
使用sql和下面的查询可以实现相同的操作
select opportunities.* , data_dump.OpportunityID
from opportunities
left join data_dump on (opportunities.LeadIdentifier=data_dump.LeadId and opportunities.ProductSku=data_dump.ProductName)
我本来想做这样的事情,但这对这个大数据来说效率很低——
fetched_opportunities = pd.read_csv(path + "/data_dump.csv").fillna('')
data_obj = fetched_opportunities.to_dict(orient='records')
fetched_opportunities2 = pd.read_csv(path + "/opportunities.csv").fillna('')
data_obj2 = fetched_opportunities2.to_dict(orient='records')
for opportunity_detail2 in data_obj:
for opportunity_detail1 in data_obj:
if opportunity_detail2['LeadIdentifier'] == opportunity_detail1['LeadId'] & opportunity_detail2['ProductSku'] == opportunity_detail1['ProductName']:
尝试使用merge
函数,如下所示:
fetched_opportunities = pd.read_csv(path + "/data_dump.csv").fillna('')
fetched_opportunities2 = pd.read_csv(path + "/opportunities.csv").fillna('')
out=fetched_opportunities[["OpportunityID","LeadId","ProductName"]].merge(fetched_opportunities2,how='left',left_on=['LeadId','ProductName'],right_on=['LeadIdentifier','ProductSku']).drop(["LeadId","ProductName"],axis=1)