我想从现有的DataFrame中重新组织Pandas DataFrame中的一些数据,这样我就可以有一个行值的子集(在下面的最小示例中'City'作为列,并删除'New York'和'Amsterdam'下面没有选择值的行)。
为了说明上面的意思,我有一个像这样的DataFrame:
# Id | Start_Time | End_Time | City | Price
--------------------------------------
1 | 2022-01-01 00:00:00.0000000 +01:00 | 2022-01-01 01:00:00.0000000 +01:00 | New York | 100
2 | 2022-01-01 01:00:00.0000000 +01:00 | 2022-01-01 02:00:00.0000000 +01:00 | New York | 90
3 | 2022-01-01 02:00:00.0000000 +01:00 | 2022-01-01 03:00:00.0000000 +01:00 | New York | 95
4 | 2022-01-01 01:00:00.0000000 +01:00 | 2022-01-01 02:00:00.0000000 +01:00 | Amsterdam | 300
5 | 2022-01-01 02:00:00.0000000 +01:00 | 2022-01-01 03:00:00.0000000 +01:00 | Amsterdam | 250
我想创建一个DataFrame,看起来像这样:
# Period | New York | Amsterdam | Difference
------------------------------------------------------------------------------------------
2022-01-01 00:00:00.0000000 +01:00 - 2022-01-01 01:00:00.0000000 +01:00 | 100 | NaN | NaN
2022-01-01 01:00:00.0000000 +01:00 - 2022-01-01 02:00:00.0000000 +01:00 | 90 | 300 | -210
2022-01-01 02:00:00.0000000 +01:00 - 2022-01-01 03:00:00.0000000 +01:00 | 95 | 250 | -155
当前代码:
下面是我的代码:import pandas as pd
data = [
[1, '2022-01-01 00:00:00.0000000 +01:00', '2022-01-01 01:00:00.0000000 +01:00', 'New York', 100],
[2, '2022-01-01 01:00:00.0000000 +01:00', '2022-01-01 02:00:00.0000000 +01:00', 'New York', 90],
[3, '2022-01-01 02:00:00.0000000 +01:00', '2022-01-01 03:00:00.0000000 +01:00', 'New York', 95],
[4, '2022-01-01 01:00:00.0000000 +01:00', '2022-01-01 02:00:00.0000000 +01:00', 'Amsterdam', 300],
[5, '2022-01-01 02:00:00.0000000 +01:00', '2022-01-01 03:00:00.0000000 +01:00', 'Amsterdam', 250]
]
df = pd.DataFrame(data, columns=['Id', 'Start_Time', 'End_Time', 'City', 'Price'])
ny = df[df['City'] == 'New York']
amsterdam = df[df['Auction'] == 'Amsterdam']
# Here I naively try to create a new DataFrame with the price in the two cities as colums.
# I figure I can then add a new column with the difference with df_new['Difference'] = df_new['New York'] - df_new['Amsterdam']
df_new = pd.DataFrame().assign(ny=ny['Price'], amsterdam=amsterdam['Price'])
我如何从看起来像第一个的DataFrame到看起来像后者的DataFrame ?
由于顺序很重要,我们首先需要基于City创建一个分类列。
然后我们可以使用pd.pivot_table
创建一个枢轴df['period'] = df['Start_Time']+ " - " + df['End_time']
#assuming these are already strings, if not you'll need to cast them.
df['City'] = pd.Categorical(df['City'],
ordered=True, categories=['New York', 'Amsterdam'])
df1 = df.pivot_table(index='period',
columns='City', values='Price',aggfunc='first')
最后,您的差异栏
df1['sub'] = df1.diff(axis=1).iloc[:,-1]
print(df1)
City New York Amsterdam sub
period
2022-01-01 00:00:00.0000000 +01:00 - 2022-01-01... 100.0 NaN NaN
2022-01-01 01:00:00.0000000 +01:00 - 2022-01-01... 90.0 300.0 210.0
2022-01-01 02:00:00.0000000 +01:00 - 2022-01-01... 95.0 250.0 155.0