Pandas将多键字典映射到数据框架



我有以下数据集:

+────────────────────────────────+─────────────────+─────────────────+
| product_name                   | column_value_1  | column_value_2  |
+────────────────────────────────+─────────────────+─────────────────+
| Coca Cola 1L                   | 1               | 1.8             |
| Carbonated drink Coca Cola 1L  | NaN             | 1.9             |
| Кола Кола                      | 2               | NaN             |
| Yellow Cheese                  | NaN             | 4.2             |
| Packed Yellow Cheese 1KG       | 4               | 5               |
| Packed Yellow Cheese           | 4.8             | 5               |
| Yogurt                         | 2.2             | 2               |
+────────────────────────────────+─────────────────+─────────────────+

和后面的字典

product_map = {
("Кока Кола", "Coca Cola 1L", "Carbonated drink Coca Cola 1L") : "Coca Cola 1L" ,
("Yellow Cheese", "Packed Yellow Cheese 1KG", "Packed Yellow Cheese") : "Packed Yellow Chees" 
}

请注意,最后一个产品在字典中不存在,这就是为什么我应该在group_by_column中一无所获

我想实现以下表,之后我想按此列分组。这怎么可能呢,也可以采用不同的方法,而不是使用dictionary和map。

+────────────────────────────────+─────────────────+─────────────────+──────────────────────+
| product_name                   | column_value_1  | column_value_2  | group_by_column      |
+────────────────────────────────+─────────────────+─────────────────+──────────────────────+
| Coca Cola 1L                   | 1               | 1.8             | Coca Cola 1L         |
| Carbonated drink Coca Cola 1L  | NaN             | 1.9             | Coca Cola 1L         |
| Coca Cola                      | 2               | NaN             | Coca Cola 1L         |
| Yellow Cheese                  | NaN             | 4.2             | Packed Yellow Chees  |
| Packed Yellow Cheese 1KG       | 4               | 5               | Packed Yellow Chees  |
| Packed Yellow Cheese           | 4.8             | 5               | Packed Yellow Chees  |
| Yogurt                         | 2.2             | 2               | NaN                  |
+────────────────────────────────+─────────────────+─────────────────+──────────────────────+

扩展product_map应该会简化很多事情:

product_map_expanded = {i:val for key, val in product_map.items() for i in key }

product_map_expanded现在看起来像:

{'Кола Кола': 'Coca Cola 1L',
'Coca Cola 1L': 'Coca Cola 1L',
'Carbonated drink Coca Cola 1L': 'Coca Cola 1L',
'Yellow Cheese': 'Packed Yellow Chees',
'Packed Yellow Cheese 1KG': 'Packed Yellow Chees',
'Packed Yellow Cheese': 'Packed Yellow Chees'}

现在你可以将它映射为:

df['group_by_column'] = df.product_name.map(product_map_expanded)

使你的数据框架:

group_by_column可口可乐1L碳酸饮料可口可乐1L南可口可乐1LКола Кола可口可乐1L包装黄奶酪包装黄奶酪包装黄奶酪

您可以使用键-列表字典并将lambda应用于新列,该列调用函数来搜索列表中的索引名称并返回它的键。

df = pd.DataFrame(df).set_index('product_name')
product_map = {
"Coca Cola 1L": ["Кола Кола", "Coca Cola 1L", "Carbonated drink Coca Cola 1L"],
"Packed Yellow Chees": ["Yellow Cheese", "Packed Yellow Cheese 1KG", "Packed Yellow Cheese"],
}

def get_key(k):
global product_map
for key, item_list in product_map.items():
if k in item_list:
return key

df['group_by_column'] = df.apply(lambda x: get_key(x.name), axis=1)
print(df)

column_value_1  column_value_2      group_by_column
product_name
Coca Cola 1L                              1.0             1.8         Coca Cola 1L
Carbonated drink Coca Cola 1L             NaN             1.9         Coca Cola 1L
Кола Кола                                 2.0             NaN         Coca Cola 1L
Yellow Cheese                             NaN             4.2  Packed Yellow Chees
Packed Yellow Cheese 1KG                  4.0             5.0  Packed Yellow Chees
Packed Yellow Cheese                      4.8             5.0  Packed Yellow Chees
Yogurt                                    2.2             2.0                 None

最新更新