如果if条件(例如,单元格包含值(为true,我会尝试"拆分/复制"一行。。。
例如,我有一张表:
d = {'Invited_guest': ["Max", "Luca", "John", "Biran", "Ian"], 'Age': [19, 21, 32, 45, 34], 'Origin' : ['US', 'UK', 'GER', 'ITA', 'FRA'],'FamilyMember_1': ["Paul", "Anna", "Peter", "Lewis", "Jeremy"], 'FamilyMember_2': ['Rene', 'Ruben', 'Calvin', 'George', 'Silke'], 'FamilyMember_3': ['', 'Olivia', '', '', 'Selina']}
df = pd.DataFrame(data=d)
df
索引 | Invited_guest | 年龄来源 | >FamilyMember_1 | FamilyMembel_2 | >FamilyMember_3 | |
---|---|---|---|---|---|---|
0 | 最大值 | 19 | 美国 | 保罗雷内|||
1 | 卢卡 | 21英国安娜鲁本奥利维亚|||||
2 | John | 32 | GER | Peter | Calvin||
3 | Brian | 45 | ITALewis乔治||||
伊恩 | 34 | FRAJeremySilkeSelina
您可以使用melt
、groupby
和agg(list)
的组合将每行FamilyMember
名称转换为列表(并使用pipe
删除空名称(,然后assign
将结果返回到数据帧,explode
该列:
exploded = df.assign(names=df.filter(like='FamilyMember_').T.melt().pipe(lambda x: x[x['value'] != '']).groupby('variable')['value'].agg(list)).explode('names').drop(df.filter(like='FamilyMember_'), axis=1).reset_index(drop=True)
输出:
>>> exploded
Invited_guest Age Origin names
0 Max 19 US Paul
1 Max 19 US Rene
2 Luca 21 UK Anna
3 Luca 21 UK Ruben
4 Luca 21 UK Olivia
5 John 32 GER Peter
6 John 32 GER Calvin
7 Brian 45 ITA Lewis
8 Brian 45 ITA George
9 Ian 34 FRA Jeremy
10 Ian 34 FRA Silke
11 Ian 34 FRA Selina
解释
首先,我们选择以FamilyMember_
:开头的列
>>> family_members = df.filter(like='FamilyMember_')
>>> family_members
FamilyMember_1 FamilyMember_2 FamilyMember_3
0 Paul Rene
1 Anna Ruben Olivia
2 Peter Calvin
3 Lewis George
4 Jeremy Silke Selina
接下来,我们将其旋转90度(也称为转座(,以便稍后与melt
:一起使用
>>> family_members.T
0 1 2 3 4
FamilyMember_1 Paul Anna Peter Lewis Jeremy
FamilyMember_2 Rene Ruben Calvin George Silke
FamilyMember_3 Olivia Selina
然后,我们将melt
它:
>>> family_members.T.melt()
variable value
0 0 Paul
1 0 Rene
2 0
3 1 Anna
4 1 Ruben
5 1 Olivia
6 2 Peter
7 2 Calvin
8 2
9 3 Lewis
10 3 George
11 3
12 4 Jeremy
13 4 Silke
14 4 Selina
现在我们需要删除空项目。我们可以这样做:
x = family_members.T.melt()
x = x[x['values'] != '']
但这是多行,一行不行。因此,我们可以将pipe
与lambda函数一起使用,以一种线性方式实现这一点:
>>> family_members.T.melt().pipe(lambda x: x[x['value'] != ''])
variable value
0 0 Paul
1 0 Rene
3 1 Anna
4 1 Ruben
5 1 Olivia
6 2 Peter
7 2 Calvin
9 3 Lewis
10 3 George
12 4 Jeremy
13 4 Silke
14 4 Selina
然后,我们可以按variable
列进行分组,因为它将需要组合在一起的名称完美地组合在一起:
>>> g = family_members.T.melt().pipe(lambda x: x[x['value'] != '']).groupby('variable')
>>> g
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x12b131e50>
# That's not very useful, so we can convert it to a list to have a peek at what's inside:
>>> list(g)
[(0,
variable value
0 0 Paul
1 0 Rene),
(1,
variable value
3 1 Anna
4 1 Ruben
5 1 Olivia),
(2,
variable value
6 2 Peter
7 2 Calvin),
(3,
variable value
9 3 Lewis
10 3 George),
(4,
variable value
12 4 Jeremy
13 4 Silke
14 4 Selina)]
我们需要将每个组转换为包含在其中的名称列表。这就是agg(list)
的作用:
>>> g['value'].agg(list)
variable
0 [Paul, Rene]
1 [Anna, Ruben, Olivia]
2 [Peter, Calvin]
3 [Lewis, George]
4 [Jeremy, Silke, Selina]
Name: value, dtype: object
完美。现在我们需要将该列放回数据帧中。我们可以像往常一样分配:
g['names'] = g['value'].agg(list)
但同样,这将使一句话变得不可能。幸运的是,有assign
函数,它是为这个用例构建的:
>>> df.assign(names=g['value'].agg(list))
Invited_guest Age Origin FamilyMember_1 FamilyMember_2 FamilyMember_3 names
0 Max 19 US Paul Rene [Paul, Rene]
1 Luca 21 UK Anna Ruben Olivia [Anna, Ruben, Olivia]
2 John 32 GER Peter Calvin [Peter, Calvin]
3 Biran 45 ITA Lewis George [Lewis, George]
4 Ian 34 FRA Jeremy Silke Selina [Jeremy, Silke, Selina]
(请注意,assign
未就位。它修改数据帧的新副本,而不是原始副本(。
最后,我们使用神奇的explode
(仅适用于Panda 0.25及更新版本(:
>>> df.assign(names=g['value'].agg(list)).explode('names')
Invited_guest Age Origin FamilyMember_1 FamilyMember_2 FamilyMember_3 names
0 Max 19 US Paul Rene Paul
0 Max 19 US Paul Rene Rene
1 Luca 21 UK Anna Ruben Olivia Anna
1 Luca 21 UK Anna Ruben Olivia Ruben
1 Luca 21 UK Anna Ruben Olivia Olivia
2 John 32 GER Peter Calvin Peter
2 John 32 GER Peter Calvin Calvin
3 Biran 45 ITA Lewis George Lewis
3 Biran 45 ITA Lewis George George
4 Ian 34 FRA Jeremy Silke Selina Jeremy
4 Ian 34 FRA Jeremy Silke Selina Silke
4 Ian 34 FRA Jeremy Silke Selina Selina
当然,删除仅有的FamilyMember_*
列:
>>> family_member_columns = df.filter(like='FamilyMember_').columns
>>> family_member_columns
Index(['FamilyMember_1', 'FamilyMember_2', 'FamilyMember_3'], dtype='object')
>>> df.assign(names=g['value'].agg(list)).explode('names').drop(family_member_columns, axis=1)
首先,我们得到所有以FamilyMember_
开头的列
然后,我们可以使用pandas.melt
来获得预期的结果
要获得干净的输出,我们可以删除melt
创建的输出variable
,然后删除NaN
值,因为有些Invited_guest
没有FamilyMember_3
,我们按Invited_guest
对值进行排序,并重置索引以获得干净有序的最终DataFrame
:
>>> keys = [c for c in df if c.startswith('FamilyMember_')]
>>> pd.melt(df, id_vars=['Invited_guest', 'Age', 'Origin'], value_vars=keys, value_name='key').drop('variable', axis=1).dropna().sort_values('Invited_guest').reset_index(drop=True)
Invited_guest Age Origin key
0 Brian 45 ITA Lewis
1 Brian 45 ITA George
2 Ian 34 FRA Jeremy
3 Ian 34 FRA Silke
4 Ian 34 FRA Selina
5 John 32 GER Peter
6 John 32 GER Calvin
7 Luca 21 UK Anna
8 Luca 21 UK Ruben
9 Luca 21 UK Olivia
10 Max 19 US Paul
11 Max 19 US Rene