我只需要每个航空公司的前5个原因。我设法得到了所有航空公司的交叉表,但它没有排序,它显示了所有的原因。我怎样才能缩小搜索范围?
pd.crosstab(df.airline, df.negativereason).apply(lambda x: x, axis=1)
>negativereason Bad Flight Can't Tell Cancelled Flight Customer Service Issue Damaged Luggage Flight Attendant Complaints Flight Booking Problems Late Flight Lost Luggage longlines
airline
>American 87 198 246 768 12 87 130 249 149 34
>Delta 64 186 51 199 11 60 44 269 57 14
>Southwest 90 159 162 391 14 38 61 152 90 29
>US Airways 104 246 189 811 11 123 122 453 154 50
>United 216 379 181 681 22 168 144 525 269 48
想要的结果
>American
>Customer Service Issue 768
>Late Flight 249
>Cancelled Flight 246
>Can't Tell 198
>Lost Luggage 149
这是数据集
>tweet_id airline_sentiment airline_sentiment_confidence negativereason negativereason_confidence airline airline_sentiment_gold name negativereason_gold retweet_count text tweet_coord tweet_created tweet_location user_timezone
>0 570306133677760513 neutral 1.0000 NaN NaN Virgin America NaN cairdin NaN 0 @VirginAmerica What @dhepburn said. NaN 2015-02-24 11:35:52 -0800 NaN Eastern Time (US & Canada)
>1 570301130888122368 positive 0.3486 NaN 0.0000 Virgin America NaN jnardino NaN 0 @VirginAmerica plus you've added commercials t... NaN 2015-02-24 11:15:59 -0800 NaN Pacific Time (US & Canada)
>2 570301083672813571 neutral 0.6837 NaN NaN Virgin America NaN yvonnalynn NaN 0 @VirginAmerica I didn't today... Must mean I n... NaN 2015-02-24 11:15:48 -0800 Lets Play Central Time (US & Canada)
>3 570301031407624196 negative 1.0000 Bad Flight 0.7033 Virgin America NaN jnardino NaN 0 @VirginAmerica it's really aggressive to blast... NaN 2015-02-24 11:15:36 -0800 NaN Pacific Time (US & Canada)
>4 570300817074462722 negative 1.0000 Can't Tell 1.0000 Virgin America NaN jnardino NaN 0 @VirginAmerica and it's a really big bad thing... NaN 2015-02-24 11:14:45 -0800 NaN Pacific Time (US & Canada)
这不是最好的解决方案,但它确实起到了作用。
top_n = 5
gb = df.groupby(['airline', 'negativereason']).size().reset_index(name='freq')
df_tops = gb.groupby('airline').apply(lambda x: x.nlargest(top_n, ['freq'])).reset_index(drop=True)
需要两个步骤。首先是计算每家航空公司每个负面原因的频率,其次是根据频率取top_n个原因。
我设法得到了每个航班的每个负面原因的计数。但是我仍然不能得到每个航空公司的前5个结果,从高到低排序。
count = df.groupby(['airline','negativereason']).size()
print(count)
>airline negativereason
>American Bad Flight 87
> Can't Tell 198
> Cancelled Flight 246
> Customer Service Issue 768
> Damaged Luggage 12
> Flight Attendant Complaints 87
> Flight Booking Problems 130
> Late Flight 249
> Lost Luggage 149
> longlines 34
>Delta Bad Flight 64
> Can't Tell 186
> Cancelled Flight 51
> Customer Service Issue 199
> Damaged Luggage 11
> Flight Attendant Complaints 60
> Flight Booking Problems 44
> Late Flight 269
> Lost Luggage 57
> longlines 14
一种方法:
数据集
,Bad Flight,Cant Tell, Cancelled Flight,Customer Service Issue,Damaged Luggage,Flight Attendant Complaints,Flight Booking Problems,Late Flight,Lost Luggage,Longlines Airline
American,87,198,246,768,12,87,130,249,149,34
Delta,64,186,51,199,11,60,44,269,57,14
Southwest,90,159,162,391,14,38,61,152,90,29
US Airways,104,246,189,811,11,123,122,453,154,50
United,216,379,181,681,22,168,144,525,269,48
import pandas as pd
air = pd.read_csv("airlines.csv", index_col = 0)
print(air)
print(" ")
american5 = air.loc["American"].sort_values(ascending = False).get(range(5))
print(american5)
Bad Flight Cant Tell Cancelled Flight Customer Service Issue Damaged Luggage Flight Attendant Complaints Flight Booking Problems Late Flight Lost Luggage Longlines Airline
American 87 198 246 768 12 87 130 249 149 34
Delta 64 186 51 199 11 60 44 269 57 14
Southwest 90 159 162 391 14 38 61 152 90 29
US Airways 104 246 189 811 11 123 122 453 154 50
United 216 379 181 681 22 168 144 525 269 48
Customer Service Issue 768
Late Flight 249
Canceled Flight 246
Cant Tell 198
Lost Luggage 149
Name: American, dtype: int64