更短、更高效的熊猫代码,用于基于累积的数据选择和基于列的数据选择



以下是要求。

共有两个表:brand_df(品牌价值(和score_df(包含每个品牌的受试者得分(。【生成以下样本】

brand_dict = {'Brand': ['a', 'b', 'c', 'g', 'e', 'i', 'd', 'h', 'f'], 'Value': [2, 1, 3, 1, 4, 1, 3, 2, 3]}
brand_df = pd.DataFrame(brand_dict, columns=['Brand', 'Value'])
score_dict = {'a' : [4,4,3,4,0], 'b':[0,2,0,1,0], 'c':[4,0,0,4,3], 'd':[3,2,0,3,1], 'e':[0,0,2,1,0], 'f':[2,0,3,0,0], 'g':[2,3,0,0,1], 'h':[3,0,0,1,3], 'i':[0,3,3,1,0] }
score_df = pd.DataFrame(score_dict, columns=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])
score_df.index = ['sub1', 'sub2', 'sub3', 'sub4', 'sub5']
# brand_df output
#   Brand  Value
# 0     a      2
# 1     b      1
# 2     c      3
# 3     g      1
# 4     e      4
# 5     i      1
# 6     d      3
# 7     h      2
# 8     f      3
# score_df output 
#       a  b  c  d  e  f  g  h  i
# sub1  4  0  4  3  0  2  2  3  0
# sub2  4  2  0  2  0  0  3  0  3
# sub3  3  0  0  0  2  3  0  0  3
# sub4  4  1  4  3  1  0  0  1  1
# sub5  0  0  3  1  0  0  1  3  0

正在进行的操作:-

  1. 只挑选占累计价值75%的顶级品牌
brand_df.sort_values("Value", axis=0, ascending=False, inplace=True, na_position='first')  # ordered Desc to get biggest brands
brand_df['cum_percent'] = (brand_df.Value.cumsum() / brand_df.Value.sum()) * 100
# brand_df Output
# Brand  Value  cum_percent
#    e      4         20.0
#    c      3         35.0
#    d      3         50.0
#    f      3         65.0
#    a      2         75.0
#    h      2         85.0
#    b      1         90.0
#    g      1         95.0
#    i      1        100.0
selbrand = []
for index, row in brand_df.iterrows():
if row['cum_percent'] <= 75:
selbrand.append(row['Brand'])
# selbrand output
# ['e', 'c', 'd', 'f', 'a']
  1. 选择75%所选品牌得分(即>0(的受试者
# Setting threshold for subject selection. [75% percent of the selected brands must have a score] 
threshold = math.trunc(len(selbrand)*0.75) # In this case 75% of 5 is 3.75 and hence rounded to 3
selsub = []
for index, row in score_df.iterrows():
count = 0
for col in selbrand:
if row[col] > 0:
count += 1
if count >= threshold:
selsub.append(index)

我使用上面的脚本得到了下面的输出。

print(selsub)
['sub1', 'sub3', 'sub4']  # Only these subjects have score for at least 3 of the selected brands

实现这一目标的更有效或更短的方法是什么

到目前为止,您创建的cum_percent很好。下一步是删除两个循环:

selbrand = brand_df.loc[brand_df['cum_percent']<=75,'Brand']
# I tend to not use `math` package
threshold = int(len(selbrand)*0.75)
# extract the brands with [selbrand]
# compare with 0 and count along the rows
# then compare with threshold
s = score_df[selbrand].gt(0).sum(1) >= threshold
selsub = s[s].index
# print(selsub)
Index(['sub1', 'sub3', 'sub4'], dtype='object')

最新更新