我有带有类标签的数据框,现在在类级别的基础上,我想分离与不同类标签关联的数据。代码给出如下:
import pandas as pd
df = [[0.572,0.845,-1.616,-0.827,-0.158,-0.097,0],
[0.572,0.845,-1.616,-0.827,-0.158,-0.097,2],
[0.572,0.845,-1.616,-0.827,-0.158,-0.097,1],
[0.572,0.845,-1.616,-0.827,-0.158,-0.097,2],
[0.572,0.845,-1.616,-0.827,-0.158,-0.097,3],
[0.572,0.845,-1.616,-0.827,-0.158,-0.097,0],
[0.572,0.845,-1.616,-0.827,-0.158,-0.097,1]]
df = pd.DataFrame(df, columns=["a","b","c","d","e","f","class_label"])
l = list(set(df["class_label"]))
ls = list(df["class_label"])
for l in l:
for n,ls in enumerate(ls):
if l == ls:
print df[n:n+1]
程序因下面给出的错误而终止:
a b c d e f class_label
0 0.572 0.845 -1.616 -0.827 -0.158 -0.097 0
a b c d e f class_label
5 0.572 0.845 -1.616 -0.827 -0.158 -0.097 0
Traceback (most recent call last):
File "sample.py", line 19, in <module>
for n,ls in enumerate(ls):
TypeError: 'numpy.int64' object is not iterable
而预期输出应为:
class_1
0.572,0.845,-1.616,-0.827,-0.158,-0.097,0,
0.572,0.845,-1.616,-0.827,-0.158,-0.097,0,
class_2
0.572,0.845,-1.616, -0.27,-0.158,-0.097,1,
0.572,0.845,-1.616,-0.827,-0.158,-0.097,1
class_3
0.572,0.845,-1.616,-0.827,-0.158,-0.097,2,
0.572,0.845,-1.16,-0.827,-0.158,-0.097,2,
class_4
0.572,0.845,-1.616, -0.27,-0.158,-0.097,3,
我认为您需要按第 class_label
列循环输出groupby
:
for i, g in df.groupby('class_label'):
print 'class_' + str(i + 1)
print g
class_1
a b c d e f class_label
0 0.572 0.845 -1.616 -0.827 -0.158 -0.097 0
5 0.572 0.845 -1.616 -0.827 -0.158 -0.097 0
class_2
a b c d e f class_label
2 0.572 0.845 -1.616 -0.827 -0.158 -0.097 1
6 0.572 0.845 -1.616 -0.827 -0.158 -0.097 1
class_3
a b c d e f class_label
1 0.572 0.845 -1.616 -0.827 -0.158 -0.097 2
3 0.572 0.845 -1.616 -0.827 -0.158 -0.097 2
class_4
a b c d e f class_label
4 0.572 0.845 -1.616 -0.827 -0.158 -0.097 3
如果您需要输出作为DataFrames
并且index
并不重要:
print df
a b c d e f
class_label
0 0.572 0.845 -1.616 -0.827 -0.158 -0.097
0 0.572 0.845 -1.616 -0.827 -0.158 -0.097
1 0.572 0.845 -1.616 -0.827 -0.158 -0.097
1 0.572 0.845 -1.616 -0.827 -0.158 -0.097
2 0.572 0.845 -1.616 -0.827 -0.158 -0.097
2 0.572 0.845 -1.616 -0.827 -0.158 -0.097
3 0.572 0.845 -1.616 -0.827 -0.158 -0.097
print ['class_' + str(x + 1) for x in df.index]
['class_1', 'class_1', 'class_2', 'class_2', 'class_3', 'class_3', 'class_4']
#change index
df.index = ['class_' + str(x + 1) for x in df.index]
print df
a b c d e f
class_1 0.572 0.845 -1.616 -0.827 -0.158 -0.097
class_1 0.572 0.845 -1.616 -0.827 -0.158 -0.097
class_2 0.572 0.845 -1.616 -0.827 -0.158 -0.097
class_2 0.572 0.845 -1.616 -0.827 -0.158 -0.097
class_3 0.572 0.845 -1.616 -0.827 -0.158 -0.097
class_3 0.572 0.845 -1.616 -0.827 -0.158 -0.097
class_4 0.572 0.845 -1.616 -0.827 -0.158 -0.097
如果index
很重要,则必须更改Multiindex
:
df = df.set_index(['class_label'], append=True).sort_index(level=1)
df.index = df.index.swaplevel(0,1)
print df
a b c d e f
class_label
0 0 0.572 0.845 -1.616 -0.827 -0.158 -0.097
5 0.572 0.845 -1.616 -0.827 -0.158 -0.097
1 2 0.572 0.845 -1.616 -0.827 -0.158 -0.097
6 0.572 0.845 -1.616 -0.827 -0.158 -0.097
2 1 0.572 0.845 -1.616 -0.827 -0.158 -0.097
3 0.572 0.845 -1.616 -0.827 -0.158 -0.097
3 4 0.572 0.845 -1.616 -0.827 -0.158 -0.097
names = df.index.get_level_values('class_label').tolist()
print ['class_' + str(x + 1) for x in names]
['class_1', 'class_1', 'class_2', 'class_2', 'class_3', 'class_3', 'class_4']
#change multiindex
new_index = zip(['class_' + str(x + 1) for x in names] ,df.index.get_level_values(1))
df.index = pd.MultiIndex.from_tuples(new_index, names = df.index.names)
print df
a b c d e f
class_label
class_1 0 0.572 0.845 -1.616 -0.827 -0.158 -0.097
5 0.572 0.845 -1.616 -0.827 -0.158 -0.097
class_2 2 0.572 0.845 -1.616 -0.827 -0.158 -0.097
6 0.572 0.845 -1.616 -0.827 -0.158 -0.097
class_3 1 0.572 0.845 -1.616 -0.827 -0.158 -0.097
3 0.572 0.845 -1.616 -0.827 -0.158 -0.097
class_4 4 0.572 0.845 -1.616 -0.827 -0.158 -0.097
根据您要对数据执行的操作,groupby
可能很有用。
import numpy as np
grouped = df.groupby("class_label")
grouped.aggregate([np.min, np. mean, np.max, np. std])