为每个ID映射2df但列到值而不是值到值



我有一个包含前3个原因的表(表1)和另一个包含每个变量所属类别的表(表2)。我正在尝试将类别箱匹配到表3中的原因表中。

Table 1: top 3 reasons, basically which 3 variable was the most important for that Register number
RegNo    Reason1  Reason2 Reason3
1111         v3       v2     v6
2222         v2       v3     v5
3333         v3       v2     v6
4444         v3       v6     v2
5555         v3       v2     v5

Table 2: The category bin for each variable 
RegNo           v2              v3        v4          v5                 v6
1111    (0.44, 0.64]    (0.0, 60.0]     missing    (-inf, 3.0]   (102.0, 175.0]
2222    (-inf, 0.33]    (0.0, 60.0]     welldone   missing       missing
3333    (0.44, 0.64]    (0.0, 60.0]     rare       missing       missing
4444    (0.44, 0.64]    (0.0, 60.0]     missing    missing       (20.0, 102.0]
5555     (0.64, inf]    (0.0, 60.0]     missing    missing       (-inf, 20.0]

Table 3: For each ID, find category bin label and replace reason1,2 & 3 with the labels
RegNo            Reason1         Reason2         Reason3
1111         (0.0, 60.0]    (0.44, 0.64]     (102.0, 175.0]
2222         (-inf, 0.33]    (0.0, 60.0]       missing
3333         (0.0, 60.0]    (0.44, 0.64]       missing
4444         (0.0, 60.0]    (20.0, 102.0]    (0.44, 0.64]
5555         (0.0, 60.0]     (0.64, inf]        missing

方法

  • join()
  • 的方式索引两个数据帧
  • 则是join
  • 输出的每个原因的pd.concat()
df1 = pd.read_csv(io.StringIO("""      RegNo    Reason1  Reason2 Reason3
1111         v3       v2     v6
2222         v2       v3     v5
3333         v3       v2     v6
4444         v3       v6     v2
5555         v3       v2     v5
"""), sep="s+")
df2 = pd.read_csv(io.StringIO("""    RegNo           v2              v3        v4          v5                 v6
1111    (0.44, 0.64]    (0.0, 60.0]     missing    (-inf, 3.0]   (102.0, 175.0]
2222    (-inf, 0.33]    (0.0, 60.0]     welldone   missing       missing
3333    (0.44, 0.64]    (0.0, 60.0]     rare       missing       missing
4444    (0.44, 0.64]    (0.0, 60.0]     missing    missing       (20.0, 102.0]
5555     (0.64, inf]    (0.0, 60.0]     missing    missing       (-inf, 20.0]
"""),sep="ss+", engine="python")
# index main df
df1 = df1.set_index("RegNo")
# reshape and index reasons
dfm = df2.set_index("RegNo").stack()
df3 = pd.concat([
# add apprpriate column into index,  don't want othert columns
(df1.set_index(c, append="True").loc[:,[]]
# now it's a straight forward join
.join(dfm.rename_axis(["RegNo",c]).to_frame())
# cleanup index and rename columns of joined DF
.droplevel(1).rename(columns={0:c}))
for c in ["Reason1","Reason2","Reason3"]], axis=1)
Reason31111"text-align:左;"(0.0,60.0)"text-align:左;"(0.44,0.64)"text-align:左;"(102.0,175.0)(负无穷,0.33](0.0,60.0)3333"text-align:左;"(0.0,60.0)"text-align:左;"(0.44,0.64)4444"text-align:左;"(0.0,60.0)"text-align:左;"(20.0,102.0)"text-align:左;"(0.44,0.64)5555"text-align:左;"(0.0,60.0)"text-align:左;"(0.64,正]

相关内容

  • 没有找到相关文章

最新更新