检查数组中是否有两个或两个以上的项相同,并根据排序号删除其中一个- Python



我有一个数据框架,它根据数量和交易对代码进行了排序。

1000 5007501500

使用df_companiesdataframe对相同的行进行分组,并只保留每组的最佳排名。

输入数据:

>>> df
Code  Volume  Trade  Volume Order  Trade Order  Max Ordered Number  Final Sorted Number
0     ApplA     500   1000           2.0          2.0                 2.0                  4.0
1    Amazon    1000    500           1.0          4.0                 4.0                  2.0
2  Facebook     250    750           3.0          3.0                 3.0                  3.0
3     ApplE     100   1500           4.0          1.0                 4.0                  1.0
>>> df_companies
Codes      Code Shares
0     ApplA     Apple      A
1    Amazon    Amazon  Empty
2  Facebook  Facebook  Empty
3     ApplE     Apple      E
4     AmazA    Amazon      A
5   AmazonB    Amazon      B
out = df.sort_values('Final Sorted Number') 
.merge(df_companies[['Code', 'Codes']], how='left', 
left_on='Code', right_on='Codes', suffixes=('', '2')) 
.drop_duplicates('Code2') 
.drop(columns=['Code2', 'Codes'])

输出结果:

>>> out
Code  Volume  Trade  Volume Order  Trade Order  Max Ordered Number  Final Sorted Number
0     ApplE     100   1500           4.0          1.0                 4.0                  1.0
1    Amazon    1000    500           1.0          4.0                 4.0                  2.0
2  Facebook     250    750           3.0          3.0                 3.0                  3.0

Step by Step:

# Reduce number of columns for readability
>>> df = df[['Code', 'Final Sorted Number']]
Code  Final Sorted Number
0     ApplA                  4.0
1    Amazon                  2.0
2  Facebook                  3.0
3     ApplE                  1.0

# Sort rows by 'Final Sorted Number'
>>> df = df.sort_values('Final Sorted Number')
Code  Final Sorted Number
3     ApplE                  1.0
1    Amazon                  2.0
2  Facebook                  3.0
0     ApplA                  4.0
# Merge dataframes on a common key: 'Code' for left, 'Codes' for right
>>> df = df.merge(df_companies[['Code', 'Codes']], how='left', 
left_on='Code', right_on='Codes', suffixes=('', '2'))
Code  Final Sorted Number     Code2     Codes
0     ApplE                  1.0     Apple     ApplE
1    Amazon                  2.0    Amazon    Amazon
2  Facebook                  3.0  Facebook  Facebook
3     ApplA                  4.0     Apple     ApplA

现在,我们有来自df_companies的2个新列:Code(通过后缀重命名为Code2)和Codes。让我们继续:

# Keep only the first row for each 'Code2' name
>>> df = df.drop_duplicates('Code2')
Code  Final Sorted Number     Code2     Codes
0     ApplE                  1.0     Apple     ApplE
1    Amazon                  2.0    Amazon    Amazon
2  Facebook                  3.0  Facebook  Facebook
# Remove added columns from the other dataframe
>>> df = df.drop(columns=['Code2', 'Codes'])
Code  Final Sorted Number
0     ApplE                  1.0
1    Amazon                  2.0
2  Facebook                  3.0

最新更新