如何使用 pandas 和 Python 在类标签的基础上隔离数据



我有带有类标签的数据框,现在在类级别的基础上,我想分离与不同类标签关联的数据。代码给出如下:

import pandas as pd
df = [[0.572,0.845,-1.616,-0.827,-0.158,-0.097,0],
[0.572,0.845,-1.616,-0.827,-0.158,-0.097,2],
[0.572,0.845,-1.616,-0.827,-0.158,-0.097,1],
[0.572,0.845,-1.616,-0.827,-0.158,-0.097,2],
[0.572,0.845,-1.616,-0.827,-0.158,-0.097,3],
[0.572,0.845,-1.616,-0.827,-0.158,-0.097,0],
[0.572,0.845,-1.616,-0.827,-0.158,-0.097,1]]
df = pd.DataFrame(df, columns=["a","b","c","d","e","f","class_label"])
l = list(set(df["class_label"]))
ls = list(df["class_label"])
for l in l:
    for n,ls in enumerate(ls):
      if l == ls:
            print df[n:n+1]

程序因下面给出的错误而终止:

       a      b      c      d      e      f  class_label
0  0.572  0.845 -1.616 -0.827 -0.158 -0.097            0
       a      b      c      d      e      f  class_label
5  0.572  0.845 -1.616 -0.827 -0.158 -0.097            0
Traceback (most recent call last):
  File "sample.py", line 19, in <module>
    for n,ls in enumerate(ls):
TypeError: 'numpy.int64' object is not iterable

而预期输出应为:

class_1
0.572,0.845,-1.616,-0.827,-0.158,-0.097,0,
0.572,0.845,-1.616,-0.827,-0.158,-0.097,0,
class_2
0.572,0.845,-1.616, -0.27,-0.158,-0.097,1,
0.572,0.845,-1.616,-0.827,-0.158,-0.097,1
class_3
0.572,0.845,-1.616,-0.827,-0.158,-0.097,2,
0.572,0.845,-1.16,-0.827,-0.158,-0.097,2,
class_4
0.572,0.845,-1.616, -0.27,-0.158,-0.097,3,

我认为您需要按第 class_label 列循环输出groupby

for i, g in df.groupby('class_label'):
    print 'class_' + str(i + 1)
    print g
class_1
       a      b      c      d      e      f  class_label
0  0.572  0.845 -1.616 -0.827 -0.158 -0.097            0
5  0.572  0.845 -1.616 -0.827 -0.158 -0.097            0
class_2
       a      b      c      d      e      f  class_label
2  0.572  0.845 -1.616 -0.827 -0.158 -0.097            1
6  0.572  0.845 -1.616 -0.827 -0.158 -0.097            1
class_3
       a      b      c      d      e      f  class_label
1  0.572  0.845 -1.616 -0.827 -0.158 -0.097            2
3  0.572  0.845 -1.616 -0.827 -0.158 -0.097            2
class_4
       a      b      c      d      e      f  class_label
4  0.572  0.845 -1.616 -0.827 -0.158 -0.097            3    

如果您需要输出作为DataFrames并且index并不重要:

print df
                 a      b      c      d      e      f
class_label                                          
0            0.572  0.845 -1.616 -0.827 -0.158 -0.097
0            0.572  0.845 -1.616 -0.827 -0.158 -0.097
1            0.572  0.845 -1.616 -0.827 -0.158 -0.097
1            0.572  0.845 -1.616 -0.827 -0.158 -0.097
2            0.572  0.845 -1.616 -0.827 -0.158 -0.097
2            0.572  0.845 -1.616 -0.827 -0.158 -0.097
3            0.572  0.845 -1.616 -0.827 -0.158 -0.097
print ['class_' + str(x + 1) for x in df.index]
['class_1', 'class_1', 'class_2', 'class_2', 'class_3', 'class_3', 'class_4']
#change index
df.index = ['class_' + str(x + 1) for x in df.index]
print df
             a      b      c      d      e      f
class_1  0.572  0.845 -1.616 -0.827 -0.158 -0.097
class_1  0.572  0.845 -1.616 -0.827 -0.158 -0.097
class_2  0.572  0.845 -1.616 -0.827 -0.158 -0.097
class_2  0.572  0.845 -1.616 -0.827 -0.158 -0.097
class_3  0.572  0.845 -1.616 -0.827 -0.158 -0.097
class_3  0.572  0.845 -1.616 -0.827 -0.158 -0.097
class_4  0.572  0.845 -1.616 -0.827 -0.158 -0.097

如果index很重要,则必须更改Multiindex

df = df.set_index(['class_label'], append=True).sort_index(level=1)
df.index = df.index.swaplevel(0,1)
print df
                   a      b      c      d      e      f
class_label                                            
0           0  0.572  0.845 -1.616 -0.827 -0.158 -0.097
            5  0.572  0.845 -1.616 -0.827 -0.158 -0.097
1           2  0.572  0.845 -1.616 -0.827 -0.158 -0.097
            6  0.572  0.845 -1.616 -0.827 -0.158 -0.097
2           1  0.572  0.845 -1.616 -0.827 -0.158 -0.097
            3  0.572  0.845 -1.616 -0.827 -0.158 -0.097
3           4  0.572  0.845 -1.616 -0.827 -0.158 -0.097
names = df.index.get_level_values('class_label').tolist()
print ['class_' + str(x + 1) for x in names]
['class_1', 'class_1', 'class_2', 'class_2', 'class_3', 'class_3', 'class_4']
#change multiindex
new_index = zip(['class_' + str(x + 1) for x in names] ,df.index.get_level_values(1))
df.index = pd.MultiIndex.from_tuples(new_index, names = df.index.names)
print df
                   a      b      c      d      e      f
class_label                                            
class_1     0  0.572  0.845 -1.616 -0.827 -0.158 -0.097
            5  0.572  0.845 -1.616 -0.827 -0.158 -0.097
class_2     2  0.572  0.845 -1.616 -0.827 -0.158 -0.097
            6  0.572  0.845 -1.616 -0.827 -0.158 -0.097
class_3     1  0.572  0.845 -1.616 -0.827 -0.158 -0.097
            3  0.572  0.845 -1.616 -0.827 -0.158 -0.097
class_4     4  0.572  0.845 -1.616 -0.827 -0.158 -0.097

根据您要对数据执行的操作,groupby可能很有用。

import numpy as np
grouped = df.groupby("class_label")
grouped.aggregate([np.min, np. mean, np.max, np. std]) 

最新更新