Python:对交易进行分类的最有效方法



我有一个很大的交易列表,我想对它进行分类。 它看起来像这样:

transactions: [
{
"id": "20200117-16045-0",
"date": "2020-01-17",
"creationTime": null,
"text": "SuperB Vesterbro T 74637",
"originalText": "SuperB Vesterbro T 74637",
"details": null,
"category": null,
"amount": {
"value": -160.45,
"currency": "DKK"
},
"balance": {
"value": 12572.68,
"currency": "DKK"
},
"type": "Card",
"state": "Booked"
},
{
"id": "20200117-4800-0",
"date": "2020-01-17",
"creationTime": null,
"text": "Rent        45228",
"originalText": "Rent        45228",
"details": null,
"category": null,
"amount": {
"value": -48.00,
"currency": "DKK"
},
"balance": {
"value": 12733.13,
"currency": "DKK"
},
"type": "Card",
"state": "Booked"
},
{
"id": "20200114-1200-0",
"date": "2020-01-14",
"creationTime": null,
"text": "Superbest          86125",
"originalText": "SUPERBEST          86125",
"details": null,
"category": null,
"amount": {
"value": -12.00,
"currency": "DKK"
},
"balance": {
"value": 12781.13,
"currency": "DKK"
},
"type": "Card",
"state": "Booked"
}
]

我像这样加载数据:

with open('transactions.json') as transactions:
file = json.load(transactions)
data = json_normalize(file)['transactions'][0]
return pd.DataFrame(data)

到目前为止,我有以下类别,我想按以下方式对交易进行分组:

CATEGORIES = {
'Groceries': ['SuperB', 'Superbest'],
'Housing': ['Insurance', 'Rent']
}

现在,我想遍历数据帧中的每一行并对每个事务进行分组。 我想通过检查text是否包含CATEGORIES字典中的值之一来做到这一点。

如果是这样,该事务应归类为CATEGORIES字典的键 - 例如Groceries

如何最有效地做到这一点?

IIUC,

我们可以从您的字典中创建一个管道分隔列表,并使用.loc做一些作业

print(df)
for k,v in CATEGORIES.items():
pat = '|'.join(v)
df.loc[df['text'].str.contains(pat),'category'] = k
print(df[['text','category']])
text   category
0  SuperB Vesterbro T 74637  Groceries
1         Rent        45228    Housing
2  Superbest          86125  Groceries

更有效的解决方案:

我们创建一个包含所有值的列表,并在重新创建字典的同时使用str.extract提取它们,因此每个值现在是我们将映射到目标数据帧的键。

words = []
mapping_dict = {}
for k,v in CATEGORIES.items():
for item in v:
words.append(item)
mapping_dict[item] = k

ext = df['text'].str.extract(f"({'|'.join(words)})")
df['category'] = ext[0].map(mapping_dict)
print(df)
text   category
0  SuperB Vesterbro T 74637  Groceries
1         Rent        45228    Housing
2  Superbest          86125  Groceries

最新更新