选择分解标准Python数据帧的顶部X



你好!

这是你友好的邻居提出的另一个问题。

在之前的一篇文章中,我看到了爆炸函数的使用(我是Python的新手),这解决了这个问题,但我一直在尝试以不同的方式使用它,但我似乎无法使它发挥作用。

我有这个:

[{'Name': 'Andes, The',
'Year': 2021,
'Score': '8',
'2nd Score': 8.8,
'% of People': '87%',
'Country': 'The Netherlands',
'Fruit': 'The Apple',
'Export Countries': 'United States,United Kingdom',
'Language': 'English,Japanese,French',
'Transit Duration': 148.0,
'Quality': 1.0,
'Taste': 0.0,
'Freshness': 0.0,
'Packaging': 0.0},
{'Name': 'Phil',
'Year': 2021,
'Score': '8.5',
'2nd Score': 8.8,
'% of People': '87%',
'Country': 'Spain',
'Fruit': 'The Banana',
'Export Countries': 'United Kingdom, Germany',
'Language': 'English,German,French,Italian',
'Transit Duration': 118.0,
'Quality': 1.0,
'Taste': 0.0,
'Freshness': 0.0,
'Packaging': 0.0},
{'Name': 'Sarah',
'Year': 2021,
'Score': '9',
'2nd Score': 8.8,
'% of People': '89%',
'Country': 'Greece',
'Fruit': 'The Plum',
'Export Countries': 'Germany,United States',
'Language': 'English,German,French,Italian',
'Transit Duration': 165.0,
'Quality': 1.0,
'Taste': 0.0,
'Freshness': 0.0,
'Packaging': 0.0},
{'Name': 'William',
'Year': 2021,
'Score': '6',
'2nd Score': 8.8,
'% of People': '65%',
'Country': 'Brazil',
'Fruit': 'Strawberries',
'Export Countries': 'Spain,Greece',
'Language': 'English,Spanish,French',
'Transit Duration': 153.0,
'Quality': 1.0,
'Taste': 0.0,
'Freshness': 0.0,
'Packaging': 0.0},

或者,简单地说,这个:

Name | Year | Score | 2nd Score | % of People | Country | Fruit | Export Countries | Language | Transit Duration | Quality | Taste | Freshness | Packaging
Andes, The | 2021 | 8 | 8.8 | 87% | The Netherlands | The Apple | United States,United Kingdom | English,Japanese,French | 148.0 | 1.0 | 0.0 | 0.0 | 0.0
Phil | 2021 | 8.5 | 8.4 | 87% | Spain | The Banana | United Kingdom, Germany | English,German,French,Italian | 165.0 | 1.0 | 0.0 | 0.0 | 0.0
Sarah | 2021 | 9 | 8.3 | 89% | Greece | The Plum | Germany,United States | English,German,French,Italian | 153.0 | 1.0 | 0.0 | 0.0 | 0.0
William | 2021 | 6 | 8.8 | 65% | Brazil | Strawberries | Spain,Greece | English,Spanish,French | 153.0 | 1.0 | 0.0 | 0.0 | 0.0

现在,在前一篇文章中,我被帮助将语言分离出来,然后在应用平均值时对它们进行相应的分组:

(df[['Score', 'Language']]
.assign(Language=lambda x: x.Language.str.split(','))
.explode('Language')
.groupby('Language')
.Score.mean()
.reset_index())

哪个吐出来了:

Language     Score
0   English  8.333333
1    French  8.333333
2    German  8.500000
3   Italian  8.500000
4  Japanese  8.000000

然后,我尝试以另一种方式使用此逻辑,即Name列,但只选择每种语言的Top x行。为了以防万一,我不希望每种语言都包含在一个语法中。我的目标是为每种语言一次运行一个。

因此,对于英语,它将根据Score从高到低的排序来选择Top x名称。预期的输出是这样的:

Language  Top X
0   English  Phil
1   English  Sarah
2   English  Andes, The
3   English  William

我相信,如果标准没有与','连接,我可以使用head(x).sort_values(ascending=False)的组合,所以我遇到的问题是需要.assign(Language=lambda x: x.Language.str.split(',')),我认为这是必需的(但很高兴错了!)

有人能帮我吗?

干杯

我认为在第一步中,explode有必要用,拆分值,在下一步中,用两列排序,也可以用LanguageDataFrame.sort_valuesGroupBy.head:排序

df1 = df.assign(Language=lambda x: x.Language.str.split(',')).explode('Language')

那么您的解决方案可能会简化:

df0 = df1.groupby('Language', as_index=False).Score.mean()

另一种解决方案使用:

N = 5
df2 = (df1.sort_values(['Language', 'Score'], ascending=False)
.set_index('Language')
.groupby('Language')['Name']
.head(N)
.reset_index(name=f'Top {N}'))
print (df2)
Language       Top 5
0    Spanish     William
1   Japanese  Andes, The
2    Italian       Sarah
3    Italian        Phil
4     German       Sarah
5     German        Phil
6     French       Sarah
7     French        Phil
8     French  Andes, The
9     French     William
10   English       Sarah
11   English        Phil
12   English  Andes, The
13   English     William

最新更新