我有一个列名为description
的pandas系列(data),并创建了一个新列Label
,它只是检查description
列中是否存在字典键,如果存在,则根据找到的键标记描述,例如
description Label
427096 alat airtime recharge bills
1093255 alat nip transfer transfers
549792 alat transfer transfers
1163429 wema ussd transfer transfers
字典
labels = { #transfer
"tnf":"transfers", "trsf":"transfers","trtr":"transfers", "trans":"transfers",
#bills
"otp":"bills","fee":"bills","charge":"bills",
#airtime
"recharge":"airtime","airtime":"airtime","top-up":"airtime",
}
下面是检查的函数:
labs = []
# Labelling the transaction according to the dictionary defined
for i in data:
f = 0
#check if j is in data[i]
for j in list(labels.keys()):
if j in i:
labs.append(labels[j])
f = 1
break
if f == 0:
labs.append("others")
df["Label"] = pd.DataFrame(labs)
这里的主要问题是该函数不检查是否准确匹配时,像airtime recharge
这样的键应该标记为airtime,字典键trans
也应该标记为transfer
问题是你没有检查精确匹配,你只是检查子字符串是否在字符串中。所以'recharge'
中的'charge'
会返回True
。
所以你可以使用正则表达式或者直接将描述分割成一个列表,并检查该单词是否在列表中。
不是最有效的方法,但你可以这样做:
import pandas as pd
df = pd.DataFrame([['alat airtime recharge'],
['alat nip transfer'],
['alat transfer'] ,
['wema ussd transfer']],columns=['description'])
labels = { #transfer
"tnf":"transfers", "trsf":"transfers","trtr":"transfers", "trans":"transfers",
#bills
"otp":"bills","fee":"bills","charge":"bills",
#airtime
"recharge":"airtime","airtime":"airtime","top-up":"airtime",
}
labs = []
data = df['description']
# Labelling the transaction according to the dictionary defined
for i in data:
check_list = i.split()
f = 0
#check if j is in data[i]
loop = True
while loop==True:
for j in list(labels.keys()):
if loop==False:
break
for x in check_list:
if loop==False:
break
if x.startswith(j):
labs.append(labels[j])
f = 1
loop=False
if f == 0:
labs.append("others")
df["Label"] = pd.DataFrame(labs)
输出:
print(df)
description Label
0 alat airtime recharge airtime
1 alat nip transfer transfers
2 alat transfer transfers
3 wema ussd transfer transfers