如何使用字典键标记熊猫系列



我有一个列名为description的pandas系列(data),并创建了一个新列Label,它只是检查description列中是否存在字典键,如果存在,则根据找到的键标记描述,例如

description                         Label
427096  alat airtime recharge       bills
1093255 alat nip transfer          transfers
549792  alat transfer              transfers
1163429 wema ussd transfer         transfers

字典

labels = { #transfer
"tnf":"transfers", "trsf":"transfers","trtr":"transfers", "trans":"transfers",

#bills
"otp":"bills","fee":"bills","charge":"bills",
#airtime
"recharge":"airtime","airtime":"airtime","top-up":"airtime",
}

下面是检查的函数:

labs = []
# Labelling the transaction according to the dictionary defined
for i in data:
f = 0
#check if j is in data[i]
for j in list(labels.keys()):
if j in i:
labs.append(labels[j])
f = 1
break
if f == 0:
labs.append("others")
df["Label"] = pd.DataFrame(labs)
这里的主要问题是该函数不检查是否准确匹配时,像airtime recharge这样的键应该标记为airtime,字典键trans也应该标记为transfer

问题是你没有检查精确匹配,你只是检查子字符串是否在字符串中。所以'recharge'中的'charge'会返回True

所以你可以使用正则表达式或者直接将描述分割成一个列表,并检查该单词是否在列表中。

不是最有效的方法,但你可以这样做:

import pandas as pd

df = pd.DataFrame([['alat airtime recharge'],
['alat nip transfer'],
['alat transfer'] ,
['wema ussd transfer']],columns=['description'])

labels = { #transfer
"tnf":"transfers", "trsf":"transfers","trtr":"transfers", "trans":"transfers",

#bills
"otp":"bills","fee":"bills","charge":"bills",
#airtime
"recharge":"airtime","airtime":"airtime","top-up":"airtime",
}

labs = []
data = df['description']
# Labelling the transaction according to the dictionary defined
for i in data:
check_list = i.split()
f = 0
#check if j is in data[i]
loop = True
while loop==True:
for j in list(labels.keys()):
if loop==False:
break
for x in check_list:
if loop==False:
break
if x.startswith(j):
labs.append(labels[j])
f = 1
loop=False
if f == 0:
labs.append("others")
df["Label"] = pd.DataFrame(labs)

输出:

print(df)
description      Label
0  alat airtime recharge    airtime
1      alat nip transfer  transfers
2          alat transfer  transfers
3     wema ussd transfer  transfers

最新更新