在 Python 中将一个 df 中的每一列划分为另一个 df 中的每一列

美好的一天，

问题：我有两个数据框 - 每个公司的性能，即输出和每个公司的输入：

`firms = ['1', '2', '3']
df = pd.DataFrame(firms)
output = { 'firms': ['1', '2', '3'],
'Sales': [150, 200, 50],
'Profit':[200, 210, 90]}
df1 = pd.DataFrame.from_dict(output)
inputs = { 'firms': ['1', '2', '3'],
'Salary': [10000, 20000, 500],
'employees':[2, 4, 5]}
df2 = pd.DataFrame.from_dict(inputs)`

我需要的是将输出表中的每一列划分为输入表中的每一列。到目前为止，我以一种非常丑陋的方式做到这一点 - 通过将整个输出 tbl 除以输入表中的每一列，然后将结果合并在一起。当我有两列时，这一切都很好，但我想知道是否有更好的方法，因为我可能在一个表中有 100 列，而在另一个表中有 50 列。啊，大小可能不同也很重要，例如输入表中有 50 列，输出表中有 100 列。

frst = df1.iloc[:,0:2].divide(df2.Salary, axis = 0)
frst.columns = ['y1-x1', 'y2-x1']
sec = df1.iloc[:,0:2].divide(df2.employees, axis = 0)
sec.columns = ['y1-x2', 'y2-x2']
complete = pd.DataFrame(df).join(frst).join(sec)

输出：

|事务所 |Y1-X1 |Y2-X1 |Y1-X2 |Y2-X2 |

| 1 | 0.0200 |0.015 | 100.0 | 75.0 |

| 2 | 0.0105 |0.010 | 52.5 | 50.0 |

| 3 | 0.1800 |0.100 | 18.0 | 10.0 |

我也尝试过循环，但如果我没记错的话，因为在我的实际示例中，我有不同大小的表格，它没有成功。我将非常感谢您的建议！

我不明白为什么你不能只使用一个简单的循环。似乎您想对齐firms上的所有内容，因此将其设置为索引将解析长度不相等的任何连接或分割。

df1 = df1.set_index('firms')
df2 = df2.set_index('firms')
l = []
for col in df2.columns:
l.append(df1.div(df2[col], axis=0).add_suffix(f'_by_{col}'))
pd.concat(l, axis=1)

输出：

Sales_by_Salary  Profit_by_Salary  Sales_by_employees  Profit_by_employees
firms                                                                            
1                0.015            0.0200                75.0                100.0
2                0.010            0.0105                50.0                 52.5
3                0.100            0.1800                10.0                 18.0

所以我认为问题在于你把你的数据本质上看作是三维的，你有维度(公司、成本组成部分、收入组成部分(，你想要三个维度的每个外积的比率。

当然有一些方法可以在数据帧中完成你想做的事情，但它们很混乱。

Pandas 确实有一个名为 Panel 的 3D 对象，但它已被弃用，取而代之的是用于索引高维数据结构的更完整的解决方案，称为 xarray。把它想象成NDArrays的熊猫。

我们可以通过标记和堆叠索引将您的数据转换为 xarray DataArray：

In [2]: income = df1.set_index('firms').rename_axis(['income'], axis=1).stack('income').to_xarray()
In [3]: income
Out[3]:
<xarray.DataArray (firms: 3, income: 2)>
array([[150, 200],
[200, 210],
[ 50,  90]])
Coordinates:
* firms    (firms) object '1' '2' '3'
* income   (income) object 'Sales' 'Profit'
In [4]: costs = df2.set_index('firms').rename_axis(['costs'], axis=1).stack('costs').to_xarray()
In [5]: costs
Out[5]:
<xarray.DataArray (firms: 3, costs: 2)>
array([[10000,     2],
[20000,     4],
[  500,     5]])
Coordinates:
* firms    (firms) object '1' '2' '3'
* costs    (costs) object 'Salary' 'employees'

您现在有两个数据数组，每个数组都有两个维度，但维度不匹配。两者都按firms指数，但收入按income指数，成本按costs指数。

当对它们执行操作时，它们会自动相互广播：

In [6]: income / costs
Out[6]:
<xarray.DataArray (firms: 3, income: 2, costs: 2)>
array([[[1.50e-02, 7.50e+01],
[2.00e-02, 1.00e+02]],
[[1.00e-02, 5.00e+01],
[1.05e-02, 5.25e+01]],
[[1.00e-01, 1.00e+01],
[1.80e-01, 1.80e+01]]])
Coordinates:
* firms    (firms) object '1' '2' '3'
* income   (income) object 'Sales' 'Profit'
* costs    (costs) object 'Salary' 'employees'

这些数据现在具有您尝试实现的结构，并且此划分是使用优化的cython操作而不是循环完成的。

您可以使用内置的DataArray.to_series方法将数据转换回数据帧：

In [7]: (income / costs).to_series().to_frame(name='income to cost ratio')
Out[7]:
income to cost ratio
firms income costs
1     Sales  Salary                   0.0150
employees               75.0000
Profit Salary                   0.0200
employees              100.0000
2     Sales  Salary                   0.0100
employees               50.0000
Profit Salary                   0.0105
employees               52.5000
3     Sales  Salary                   0.1000
employees               10.0000
Profit Salary                   0.1800
employees               18.0000

输出：

相关内容

最新更新

热门标签：