python用分类树匹配数据帧



给定以下数据帧,其中一个与一些事务数据相关,另一个与一些分类规则相关:

data = {'Transaction_description': ['sfdsjk fsjfdkj;f sfsdf RESTARANT', 'fsdk ;kjf;lskf;m gjkf NL111111111111 klkfdlo', 'golf kjnfksdn DE111111111112 fkdkk', 'jhfjd jhfj Jumbo jhf'], 'Amount': [-20, -21, -30, 10]} 
Transactions = pd.DataFrame(data)  
data = {
'Priority': [1, 1, 2, 2, 3, 3], 
'Type': ['IBAN', 'IBAN', 'Company', 'Company', 'Keyword','Keyword'],
'Value': ['NL111111111111', 'DE111111111112', 'AMAZON', 'JUMBO','Restaurant','Golf'],
'Priority': [1, 1, 2, 2, 3, 3],
'Description': ['', '', '', '','',''],
'MappingCode': ['A1', 'A2', 'B1', 'B2','B1','B2']
} 
Categorization = pd.DataFrame(data)

我想根据搜索

的[Priority]对所有的[Transaction_description]进行分类
(1) IBAN 
(2) Company 
(3) keyword.

这是得到以下预期结果的最优雅的原因:

data = {
'Transaction_description': ['sfdsjk fsjfdkj;f sfsdf RESTARANT', 'fsdk ;kjf;lskf;m gjkf NL111111111111 klkfdlo', 'golf kjnfksdn DE111111111112 fkdkk', 'jhfjd jhfj Jumbo jhf'], 
'Amount': [-20, -21, -30, 10],
'MappingCode': ['B1','A1','A2','B2']
} 
TransactionsClassified = pd.DataFrame(data) 

感谢并致以最良好的问候。海布里

您的数据有点非结构化,这使它变得有些困难。

  • 首先,最好清理一下你的数据,删除一些错别字,并把所有的东西都改成小写。
  • 你可以在你的"事务"数据框架中创建一个列,其中应该包含你可以合并的字符串。您可以通过创建一个可能的字符串列表来实现这一点,然后使用np.where()。清单的顺序将决定分类的顺序。
  • 然后将您的分类数据框架合并到此字符串

你的代码可以像这样:

import numpy as np
import pandas as pd
# Data input
data = {'Transaction_description': ['sfdsjk fsjfdkj;f sfsdf RESTAURANT', 'fsdk ;kjf;lskf;m gjkf NL111111111111 klkfdlo', 'golf kjnfksdn DE111111111112 fkdkk', 'jhfjd jhfj Jumbo jhf'], 'Amount': [-20, -21, -30, 10]} 
Transactions = pd.DataFrame(data)  
data = {
'Priority': [1, 1, 2, 2, 3, 3], 
'Type': ['IBAN', 'IBAN', 'Company', 'Company', 'Keyword','Keyword'],
'Value': ['NL111111111111', 'DE111111111112', 'AMAZON', 'JUMBO','Restaurant','Golf'],
'Description': ['', '', '', '','',''],
'MappingCode': ['A1', 'A2', 'B1', 'B2','B1','B2']
} 
Categorization = pd.DataFrame(data)
# Make everything lowercase
Transactions["Transaction_description"] =  Transactions["Transaction_description"].str.lower()
Categorization["Value"] = Categorization["Value"].str.lower()
# Create the column you can merge on
keywordList = list(Categorization[Categorization["Type"] == "Keyword"]["Value"])
ibanList = list(Categorization[Categorization["Type"] == "IBAN"]["Value"])
companyList = list(Categorization[Categorization["Type"] == "Company"]["Value"])
allList = keywordList + companyList + ibanList
Transactions["Value"] = np.nan
for element in allList:
Transactions["Value"] = np.where(Transactions["Transaction_description"].str.contains(element), element, Transactions["Value"])
# Merge the dataframes
TransactionsClassified = Transactions[["Transaction_description", "Value", "Amount"]].merge(Categorization[["Value", "MappingCode"]], on="Value")

最新更新