基于现有的DataFrame创建新的DataFrame



我想从现有的DataFrame中重新组织Pandas DataFrame中的一些数据,这样我就可以有一个行值的子集(在下面的最小示例中'City'作为列,并删除'New York'和'Amsterdam'下面没有选择值的行)。

为了说明上面的意思,我有一个像这样的DataFrame:

# Id |  Start_Time                       | End_Time                           | City       | Price
--------------------------------------
1 | 2022-01-01 00:00:00.0000000 +01:00 | 2022-01-01 01:00:00.0000000 +01:00 | New York   | 100
2 | 2022-01-01 01:00:00.0000000 +01:00 | 2022-01-01 02:00:00.0000000 +01:00 | New York   | 90
3 | 2022-01-01 02:00:00.0000000 +01:00 | 2022-01-01 03:00:00.0000000 +01:00 | New York   | 95
4 | 2022-01-01 01:00:00.0000000 +01:00 | 2022-01-01 02:00:00.0000000 +01:00 | Amsterdam  | 300
5 | 2022-01-01 02:00:00.0000000 +01:00 | 2022-01-01 03:00:00.0000000 +01:00 | Amsterdam  | 250

我想创建一个DataFrame,看起来像这样:

# Period                                                                | New York | Amsterdam | Difference
------------------------------------------------------------------------------------------
2022-01-01 00:00:00.0000000 +01:00 - 2022-01-01 01:00:00.0000000 +01:00 | 100      | NaN |  NaN          
2022-01-01 01:00:00.0000000 +01:00 - 2022-01-01 02:00:00.0000000 +01:00 | 90       | 300 | -210
2022-01-01 02:00:00.0000000 +01:00 - 2022-01-01 03:00:00.0000000 +01:00 | 95       | 250 | -155

当前代码:

下面是我的代码:
import pandas as pd
data = [
[1, '2022-01-01 00:00:00.0000000 +01:00', '2022-01-01 01:00:00.0000000 +01:00', 'New York', 100],
[2, '2022-01-01 01:00:00.0000000 +01:00', '2022-01-01 02:00:00.0000000 +01:00', 'New York', 90], 
[3, '2022-01-01 02:00:00.0000000 +01:00', '2022-01-01 03:00:00.0000000 +01:00', 'New York', 95],
[4, '2022-01-01 01:00:00.0000000 +01:00', '2022-01-01 02:00:00.0000000 +01:00', 'Amsterdam', 300], 
[5, '2022-01-01 02:00:00.0000000 +01:00', '2022-01-01 03:00:00.0000000 +01:00', 'Amsterdam', 250]
]
df = pd.DataFrame(data, columns=['Id', 'Start_Time', 'End_Time', 'City', 'Price'])
ny = df[df['City'] == 'New York']
amsterdam = df[df['Auction'] == 'Amsterdam']
# Here I naively try to create a new DataFrame with the price in the two cities as colums.
# I figure I can then add a new column with the difference with df_new['Difference'] = df_new['New York'] - df_new['Amsterdam']
df_new = pd.DataFrame().assign(ny=ny['Price'], amsterdam=amsterdam['Price'])

我如何从看起来像第一个的DataFrame到看起来像后者的DataFrame ?

由于顺序很重要,我们首先需要基于City创建一个分类列。

然后我们可以使用pd.pivot_table 创建一个枢轴
df['period'] = df['Start_Time']+ " - " + df['End_time']
#assuming these are already strings, if not you'll need to cast them.
df['City'] = pd.Categorical(df['City'], 
ordered=True, categories=['New York', 'Amsterdam'])

df1 = df.pivot_table(index='period', 
columns='City', values='Price',aggfunc='first')

最后,您的差异栏

df1['sub'] = df1.diff(axis=1).iloc[:,-1]

print(df1)
City                                                New York  Amsterdam    sub
period
2022-01-01 00:00:00.0000000 +01:00 - 2022-01-01...     100.0        NaN    NaN
2022-01-01 01:00:00.0000000 +01:00 - 2022-01-01...      90.0      300.0  210.0
2022-01-01 02:00:00.0000000 +01:00 - 2022-01-01...      95.0      250.0  155.0

最新更新