我有如下的pandas数据框架:
import pandas as pd
import numpy as np
data = [['Apple', 1, 1, 1 ,1,], ['Orange', np.nan, 1, 1, np.nan], ['Banana', 1, np.nan, 1, np.nan]]
df = pd.DataFrame(data, columns = ['Type of fruit', 'Paris', "Boston", "Austin", "New York"])
输出:
Type of fruit Paris Boston Austin New York
0 Apple 1.0 1.0 1 1.0
1 Orange NaN 1.0 1 NaN
2 Banana 1.0 NaN 1 NaN
我想创建一个名为"Location"的新列,并基于四个列Paris, Boston, Austin, New York
创建新的索引,例如:
理想输出:
Location Type of fruit
0 Paris Apple
1 Boston Apple
2 Austin Apple
3 New York Apple
4 Boston Orange
5 Austin Orange
6 Paris Banana
7 Austin Banana
我可以过滤每个位置列以保持非空索引(例如Paris):
df_paris = df.loc[df["Paris"].notna(),["Type of fruit"]]
df_paris["Location"] = "Paris"
,然后连接每个位置的数据帧:
pd.concat([df_paris, df_boston, df_austin, df_new_york])
但是我确信有更好的方法使用pandas函数来做这些事情。
使用DataFrame.set_index
和DataFrame.stack
,默认会删除缺失值:
df1 = (df.set_index('Type of fruit')
.rename_axis('Location', axis=1)
.stack()
.reset_index()[['Location','Type of fruit']])
或将MultiIndex
转换为新的DataFrame
:
df1 = (df.set_index('Type of fruit')
.rename_axis('Location', axis=1)
.stack()
.swaplevel(1,0)
.index
.to_frame(index=False))
或使用DataFrame.melt
为unpivot删除缺失行DataFrame.dropna
:
df1 = (df.melt('Type of fruit', var_name='Location', ignore_index=False)
.sort_index()
.dropna(subset=['value'])[['Location','Type of fruit']]
.reset_index(drop=True))
print (df1)
Location Type of fruit
0 Paris Apple
1 Boston Apple
2 Austin Apple
3 New York Apple
4 Boston Orange
5 Austin Orange
6 Paris Banana
7 Austin Banana
(
df
.set_index("Type of fruit") # Moves the 'Type of fruit' column to the index
.rename_axis('Location', axis=1) # Sets the name of the column index to 'Location'
.stack() # Moves the remaining columns to be a second index level
.index # Select just the index (which is a MultiIndex with `Type of fruit` and `Location` as levels)
.swaplevel() # Swaps the level order so that `Location` is first
.to_frame(index=False) # Convert the MultiIndex to a DataFrame.
# `index=False` means that the resulting DF will just have a numeric index.
# If index=True, the input MultiIndex would be used as both the index and content of the DataFrame
)