是否有更好的方法来总结表信息,而不是使用iloc和附加方法与嵌套循环在熊猫?



我有一个拨款申请编号及其对应学科的表格,按单元格值1给出。

import pandas as pd
import numpy as np
data={'Application number':[0,1,2,3,4,5,6,7,8,9],
      'Physics':[0,0,1,0,0,0,0,1,1,0],
     'Chemistry':[1,0,0,0,0,0,0,0,0,1],
     'Biology':[0,1,0,1,0,0,1,0,0,0],
     'Mathematics':[0,0,0,0,1,1,0,0,0,0]}
#creation of dataframe
df=pd.DataFrame(data)
#column counting all disciplines per grant
df['All_Discipline_count']=df.loc[:,'Physics' : 'Mathematics'].sum(axis=1)
df.head(10)

我想总结一下每个拨款申请的学科列表和学科数量。我使用iloc和多个嵌套循环来实现。

# Creation of resulting dataframe
dfA = pd.DataFrame(columns = ['Application number', 'Discipline_list', 'All_Discipline_count'])
# Pay attention to how iloc a cell selects. 'Application number' is zeroth column. 
i=0 #starts from oth row
j=1 #starts from 1st column
Aanvraag_nummer=0
k=df.columns.get_loc("All_Discipline_count") #column number where the All_Discipline_count is
l=len(df.index)#number of rows
for i in range (0,l):
    Discipline_count=0 #introducing zero discipline count
    Discipline_list=" " #introducing empty discipline list
    for j in range (1,k): #counting columns of disciplines
        if (df.iloc[i,j]==1) & (Discipline_count<df.iloc[i,k]): #if the given cell has 1 as value
            Discipline_list=Discipline_list+ df.columns[j] #adds a column name to discipline list
            Discipline_count+=1 #counts the number of disciplines with 1 as value
            if Discipline_count==df.iloc[i,k]:#if all disciplines are counted
                Aanvraag_nummer=df.iloc[i,0]
                new_row = {'Application number':Aanvraag_nummer, 'Discipline_list':Discipline_list, 'All_Discipline_count':df.iloc[i,k]}
                dfA = dfA.append(new_row, ignore_index=True)
dfA.head(10)

该脚本适用于10到100个应用程序和20个学科作为列。当每个拨款申请中有多个学科时,它也会起作用。

但是,我注意到在运行代码时收到警告。

/tmp/ipykernel_26718/1290491379.py:19: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

代码也很慢..有更好的方法来获得相同的结果吗?

在这种情况下您只需要使用pandas.idxmax。注意:这只适用于每行只有一个应用程序的情况。如果有多个,它将只选择找到的第一个。

df['Application Count'] = df.iloc[:,1:].idxmax(axis=1) # iloc because we only want to apply idxmax on the applications (without application number)
#or
df['Application Count'] = df.set_index('Application number').idxmax(axis=1)

如果您每行有多个主题,这里是另一种解决方案(我为此更改了一点输入,在第0行和第8行有2个主题,在第4行我们有4个主题。

data={'Application number':[0,1,2,3,4,5,6,7,8,9],
      'Physics':[1,0,1,0,1,0,0,1,1,0],
     'Chemistry':[1,0,0,0,1,0,0,0,0,1],
     'Biology':[0,1,0,1,1,0,1,0,0,0],
     'Mathematics':[0,0,0,0,1,1,0,0,1,0]}
df=pd.DataFrame(data)
print(df)
   Application number  Physics  Chemistry  Biology  Mathematics
0                   0        1          1        0            0
1                   1        0          0        1            0
2                   2        1          0        0            0
3                   3        0          0        1            0
4                   4        1          1        1            1
5                   5        0          0        0            1
6                   6        0          0        1            0
7                   7        1          0        0            0
8                   8        1          0        0            1
9                   9        0          1        0            0
df = df.set_index('Application number')
out = (
    df[df==1]
    .stack()
    .reset_index()
    .drop(0, axis=1)
    .rename(columns={'level_1': 'Discipline_list'})
    .groupby('Application number', as_index=False)
    .agg(Discipline_list=('Discipline_list', lambda x: ', '.join(x)), All_Discipline_count=('Discipline_list', 'count'))
)
print(out)
   Application number                           Discipline_list  All_Discipline_count
0                   0                        Physics, Chemistry                     2
1                   1                                   Biology                     1
2                   2                                   Physics                     1
3                   3                                   Biology                     1
4                   4  Physics, Chemistry, Biology, Mathematics                     4
5                   5                               Mathematics                     1
6                   6                                   Biology                     1
7                   7                                   Physics                     1
8                   8                      Physics, Mathematics                     2
9                   9                                 Chemistry                     1

您可以使Discipline_list如下代码:

df.loc[:,'Physics' : 'Mathematics'].apply(lambda x: '/'.join(x[x > 0].index), axis=1)

输出:

0      Chemistry
1        Biology
2        Physics
3        Biology
4    Mathematics
5    Mathematics
6        Biology
7        Physics
8        Physics
9      Chemistry
dtype: object

如果有超过1个学科列表,则表示为"生物/物理">


使结果到Discipline_list

df['Discipline_list'] = df.loc[:,'Physics' : 'Mathematics'].apply(lambda x: '/'.join(x[x > 0].index), axis=1)

相关内容

最新更新