我在pandas数据帧中有这种格式的数据:
Customer_ID Location_ID
Alpha A
Alpha B
Alpha C
Beta A
Beta B
Beta D
我想研究客户的流动模式。我的目标是确定客户最常光顾的地点集群。我认为以下矩阵可以提供这样的信息:
A B C D
A 0 2 1 1
B 2 0 1 1
C 1 1 0 0
D 1 1 0 0
在Python中如何做到这一点?
我的数据集相当大(数十万客户和大约一百个位置)。
这里有一种方法考虑了访问的多样性(例如,如果客户X同时访问LocA和LocB两次,他将为最终矩阵中的相应位置贡献2
)。
理念:
- 对于每个地点,按客户统计访问次数
- 对于每个位置对,找到访问过这两个位置的每个客户的最小访问次数的总和
- 使用
unstack
和清理
Counter
在这里发挥得很好,因为计数器支持许多自然算术运算,如add
、max
等。
import pandas as pd
from collections import Counter
from itertools import product
df = pd.DataFrame({
'Customer_ID': ['Alpha', 'Alpha', 'Alpha', 'Beta', 'Beta'],
'Location_ID': ['A', 'B', 'C', 'A', 'B'],
})
ctrs = {location: Counter(gp.Customer_ID) for location, gp in df.groupby('Location_ID')}
# In [7]: q.ctrs
# Out[7]:
# {'A': Counter({'Alpha': 1, 'Beta': 1}),
# 'B': Counter({'Alpha': 1, 'Beta': 1}),
# 'C': Counter({'Alpha': 1})}
ctrs = list(ctrs.items())
overlaps = [(loc1, loc2, sum(min(ctr1[k], ctr2[k]) for k in ctr1))
for i, (loc1, ctr1) in enumerate(ctrs, start=1)
for (loc2, ctr2) in ctrs[i:] if loc1 != loc2]
overlaps += [(l2, l1, c) for l1, l2, c in overlaps]
df2 = pd.DataFrame(overlaps, columns=['Loc1', 'Loc2', 'Count'])
df2 = df2.set_index(['Loc1', 'Loc2'])
df2 = df2.unstack().fillna(0).astype(int)
# Count
# Loc2 A B C
# Loc1
# A 0 2 1
# B 2 0 1
# C 1 1 0
如果要忽略乘法,请将Counter(gp.Customer_ID)
替换为Counter(set(gp.Customer_ID))
。
我相信还有一种更优雅的方法,但这是我在运行中提出的解决方案。基本上,你为每个客户建立一个邻接列表,然后相应地更新邻接矩阵:
import pandas as pd
#I'm assuming you can get your data into a pandas data frame:
data = {'Customer_ID':[1,1,1,2,2],'Location':['A','B','C','A','B']}
df = pd.DataFrame(data)
#Initialize an empty matrix
matrix_size = len(df.groupby('Location'))
matrix = [[0 for col in range(matrix_size)] for row in range(matrix_size)]
#To make life easier, I made a map to go from locations
#to row/col positions in the matrix
location_set = list(set(df['Location'].tolist()))
location_set.sort()
location_map = dict(zip(location_set,range(len(location_set))))
#Group data by customer, and create an adjacency list (dyct) for each
#Update the matrix accordingly
for name,group in df.groupby('Customer_ID'):
locations = set(group['Location'].tolist())
dyct = {}
for i in locations:
dyct[i] = list(locations.difference(i))
#Loop through the adjacency list and update matrix
for node, edges in dyct.items():
for edge in edges:
matrix[location_map[edge]][location_map[node]] +=1