我正在使用Pandas进行各种应用程序,并非常感谢它,因为它使我的生活更轻松。
在大多数情况下,我正在使用同质数据,并且知道哪种数据结构最适合。到目前为止,我主要使用(多索引)数据帧和系列并行,效果很好。
但是我有点卡在当前的项目中,在该项目中,在公共对象中处理异构数据(1D 和 2D 数据)会很有帮助。
这是我使用Panel3D对象的尝试,希望能显示我正在寻找的内容:
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
# dataframes
df = pd.DataFrame(np.random.randn(6, 3))
df['concept'] = np.repeat(np.repeat(['A', 'B', 'C'], 2), 1)
df.set_index(['concept'], inplace=True)
df.sort_index(inplace=True)
df.columns = ['C1', 'C2', 'C3']
df
C1 C2 C3
concept
A -0.555291 -1.026308 -0.016192
A -1.759410 0.023008 -0.168303
B -0.471165 1.160105 0.862017
B -2.583058 0.595113 0.729354
C 0.706030 1.518058 -1.760176
C -0.290667 -0.737529 -0.177824
df2 = pd.DataFrame(np.random.randn(6, 3))
df2['concept'] = np.repeat(np.repeat(['A', 'B', 'C'], 2), 1)
df2.set_index(['concept'], inplace=True)
df2.sort_index(inplace=True)
df2.columns = ['C4', 'C5', 'C6']
df2
C4 C5 C6
concept
A 0.784534 -0.590447 -0.661132
A -0.443176 0.423495 -1.171204
B 1.103484 1.295225 0.112374
B 0.097899 -0.879873 0.213401
C -1.117570 -0.577390 1.714902
C 1.476986 1.191201 0.973319
# combine dataframes in a panel object (combine homegenous data)
data = {'Item1': df, 'Item2': df2}
my_panel = pd.Panel(data)
my_panel.describe
my_panel.ix['Item2', 'A', 'C4']
concept
A 0.784534
A -0.443176
# add a series to the dataframe (combine heterogenous data)
s = pd.Series(['gpsol', 125, 'my_simulation_x'],
index=['solver', 'runtime', 'simulation_name'])
s
solver gpsol
runtime 125
simulation_name my_simulation_x
# this doesn't work and throws an error as a panel is not the right
# data structure
# "AssertionError: Length of data and index must match"
data = {'Item1': df, 'Item2': df2, 'Item3': s}
my_panel = pd.Panel(data)
我知道 Panel3D 不打算拥有不同维度的数据,但如果我有一个可以集成 1D 和 2D 对象的(可滑动)数据结构,那就太好了。
熊猫中有这样的东西吗,或者我必须为此使用单独的熊猫对象?
如果答案是"否。熊猫不是为此而生的。我只想知道是否有适合此目的的东西。
提前感谢!
适合我的情况的解决方案,只需将(字典)系列作为属性添加到数据帧/面板对象。
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
# dataframes
df = pd.DataFrame(np.random.randn(6, 3))
df['concept'] = np.repeat(np.repeat(['A', 'B', 'C'], 2), 1)
df.set_index(['concept'], inplace=True)
df.sort_index(inplace=True)
df.columns = ['C1', 'C2', 'C3']
df
df2 = pd.DataFrame(np.random.randn(6, 3))
df2['concept'] = np.repeat(np.repeat(['A', 'B', 'C'], 2), 1)
df2.set_index(['concept'], inplace=True)
df2.sort_index(inplace=True)
df2.columns = ['C4', 'C5', 'C6']
df2
# combine dataframes in a panel object (combine homegenous data)
data = {'Item1': df, 'Item2': df2}
opt_results = pd.Panel(data)
# add a series to the dataframe (combine heterogenous data)
opt_params = pd.Series(['gpsol', 125, 'my_simulation_x'],
index=['solver', 'runtime', 'simulation_name'])
# this doesn't work and throws an error because of different indexes/dimensions
#data = {'Item1': df, 'Item2': df2, 'Item3': s}
#my_panel = pd.Panel(data)
# but setting the series as an attribute is sufficient for me
opt_results.info = opt_params
opt_results.info
solver gpsol
runtime 125
simulation_name my_simulation_x
dtype: object
opt_results.ix['Item2', 'A', 'C4']
concept
A -0.660582
A -1.174828
Name: C4, dtype: float64
也许这有点令人困惑,因为答案太明显了。