根据嵌套字典计算pandas列中的唯一ID



我有下面的panda数据帧。

d = {'id1': ['85643', '85644','85643','8564312','8564314','85645','8564316','85646','8564318','85647','85648','85649','85655','56731','34566','78931','78931'],'ID': ['G-00001', 'G-00001','G-00002','G-00002','G-00002','G-00001','vaasd','aasd','aasd','vaasd','aasd','vaasd','aasd','aasd2','aasd2','aasd2','aasd2'],'col1': [671, 2,5,3,4,5,60,0,0,6,3,2,4,32,3,1,23],'Goal': [np.nan, 56,78,np.nan,89,73,np.nan ,np.nan ,np.nan, np.nan, np.nan, 34,np.nan, 7, 84,np.nan,5 ], 'col2': [793, 4,8,32,43,55,610,0,0,16,23,72,48,3,28,5,3],'col3': [500, 22,89,33,44,55,60,1,5,6,3,2,4,13,12,14,98],'Date': ['2021-06-13', '2021-06-13','2021-06-14','2021-06-13','2021-06-14','2021-06-15','2021-06-15','2021-06-13','2021-06-16','2021-06-13','2021-06-13','2021-06-13','2021-06-16','2021-05-23','2021-05-13','2021-03-26','2021-05-13']}
dff = pd.DataFrame(data=d)
dff
id1     ID      col1    Goal    col2    col3    Date
0   85643   G-00001 671     NaN         793 500 2021-06-13
1   85644   G-00001 2       56.0000     4   22  2021-06-13
2   85643   G-00002 5       78.0000     8   89  2021-06-14
3   8564312 G-00002 3       NaN 32      33      2021-06-13
4   8564314 G-00002 4       89.0000     43  44  2021-06-14
5   85645   G-00001 5       73.0000     55  55  2021-06-15
6   8564316 vaasd   60      NaN         610 60  2021-06-15
7   85646   aasd    0       NaN         0   1   2021-06-13
8   8564318 aasd    0       NaN         0   5   2021-06-16
9   85647   vaasd   6       NaN         16  6   2021-06-13
10  85648   aasd    3       NaN         23  3   2021-06-13
11  85649   vaasd   2       34.0000     72  2   2021-06-13
12  85655   aasd    4       NaN         48  4   2021-06-16
13  56731   aasd2   32      7.0000      3   13  2021-05-23
14  34566   aasd2   3       84.0000     28  12  2021-05-13
15  78931   aasd2   1       NaN         5   14  2021-03-26
16  78931   aasd2   23      5.0000      3   98  2021-05-13

另外,我有下面的字典

dic = {'G-0001':{'aasd':['G-0001','85646','85648','345_2','85655','85659'],
'vaasd':['G-0001','85649','34554','85655','22183','45335','8564316']},
'G-0002':{'aasd2':['G-0002','85343','78931','45121','56731']},}

我想根据字典中的列表获得一个唯一的id1计数。例如,如果我们考虑这个列表"aasd2":["G-0002","85343","78931","45121","56731"]。我想知道熊猫的数据框架中有多少独特的id1。因此,它应该适用于aasd2-2值('78931','56731'(。这是pandas中aasd2仅有的两个值。

因此,我想创建一个如下所示的表,用于ID 的列表名称和计数

listName    count of ids
aasd            3
vaasd           2
aasd2           2

在python中可以做到这一点吗?任何建议都将不胜感激。提前感谢!!

从dict和merge创建一个数据帧,将其与dff一起使用,然后在('id1', 'ID'):上的drop_duplicates之后使用value_counts

data = []
for d in dic.values():
for k, l in d.items():
data.extend([(v, k) for v in l])
df1 = pd.DataFrame(data, columns=['id1', 'ID'])
out = dff.merge(df1, on=['id1', 'ID']) 
.drop_duplicates(['id1', 'ID']) 
.value_counts('ID')

输出:

>>> out
ID
aasd     3
aasd2    2
vaasd    2
dtype: int64

这里有另一种方法:

data = []
for d in dic.values():
for k, l in d.items():
data.extend([(v, k) for v in l])
df = df[df['id1', 'ID']].apply(tuple, axis=1).isin(data)].groupby("ID")["id1"].nunique()

一个选项是在生成最终数据帧之前在Python中运行整个(几乎(过程:

from collections import ChainMap
# get the dictionary for the inner dict
content = dict(ChainMap(*dic.values()))
uniq = dff.id1.unique()
content = [(key, len(set(val).intersection(uniq))) 
for key, val in content.items()]
pd.DataFrame(content, columns = ['listName', 'count of Ids'])
listName  count of Ids
0    aasd2             2
1     aasd             3
2    vaasd             3

最新更新