我正在尝试编写一个脚本来使用 dask 从 csv 中擦除信息。我有一个从csv创建的dask df,如下所示:
CUSTOMER ORDERS
hashed_customer firstname lastname email order_id status timestamp
0 eater 1_uuid 1_firstname 1_lastname 1_email 12345 OPTED_IN 2020-05-14 20:45:15
1 eater 2_uuid 2_firstname 2_lastname 2_email 23456 OPTED_IN 2020-05-14 20:29:22
2 eater 3_uuid 3_firstname 3_lastname 3_email 34567 OPTED_IN 2020-05-14 19:31:55
我有另一个csv,其中包含需要从此文件中擦除的hashed_customers。因此,如果此文件中的hashed_customer在客户订单中,我需要从行中删除名字、姓氏和电子邮件,同时保留其余部分,如下所示:
CUSTOMER ORDERS
hashed_customer firstname lastname email order_id status timestamp
0 eater 1_uuid NULL NULL NULL 12345 OPTED_IN 2020-05-14 20:45:15
1 eater 2_uuid 2_firstname 2_lastname 2_email 23456 OPTED_IN 2020-05-14 20:29:22
2 eater 3_uuid 3_firstname 3_lastname 3_email 34567 OPTED_IN 2020-05-14 19:31:55
我当前的脚本如下所示:
print('FIND ORDERS FROM OPT-OUT CUSTOMERS')
cust_opt_out_order = []
for index, row in df_in.iterrows():
if row.hashed_eater_uuid in cust_opt_out_id:
cust_opt_out_order.append(row.order_id)
print('REMOVE OPT-OUT FROM OPT-IN FILE')
df_cust_out = df_in[~df_in['hashed_eater_uuid'].isin(cust_opt_out_id)]
但这会删除整行,现在我需要保留该行,仅从该行中删除名称和电子邮件元素。如何使用熊猫从行中删除元素?
我正在尝试获得相当于熊猫的dask:
df_cust_out.loc[df_in['hashed_eater_uuid'].isin(cust_opt_out_id),['firstname','lastname', 'email']]=np.nan
我建议查看Dataframe.where或Series.where方法:
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.where