Python Pandas:从另一个数据帧的字符串注释中删除数据帧类别中的max



我有一个包含数千行的数据集,其中"value"列中有一些异常值。

df_test = pd.DataFrame({
'product': ['Egg', 'Egg', 'Egg', 'Small Egg','Small Egg','Small Egg','Small Egg', 'Wheat','Wheat','Wheat','Wheat','Wheat','Rice','Rice','Rice','Garlic','Garlic','Garlic','Garlic','Garlic','Tomato','Tomato','Tomato', 'Ananas'], 
'value': ['13','5','3','28','5','4','5','28','28','28','1','1.5','7','4','4.3','140','143','149','320','5','400','10','15', '8']                  
})

我从另一个数据集中的评论中知道哪些数据是不正确的,这一个基本上是产品列表(唯一(,其中有一个关于要删除的最大值的评论:

df_test_comment = pd.DataFrame({
'product': ['Egg', 'Small Egg', 'Wheat', 'Rice', 'Garlic','Tomato', 'Ananas'], 
'What to remove': ['1st max','1st and 2nd max','1st, 2nd, and 3rd max', '1st max', '1st, 2nd, 3rd and 4th max','1st and 2nd max', 'NaN']                  
})

因为我只有有限数量的不同评论(‘st max’,‘1st and 2nd max’,’1st,2nd,and 3rd max’,"1st,2nd,3rd and 4th max">(,所以我想在df_test中使用for循环删除产品的最大值,如果df_test_comment的评论是‘1st max’;当"第一和第二最大值"等时的最大值和第二个最大值

示例的理想输出是:

df_result = pd.DataFrame({
'product': ['Egg','Egg','Small Egg','Small Egg','Wheat','Wheat','Rice','Rice','Garlic','Tomato', 'Ananas'], 
'Value': ['5','3','4','5','1','1.5','4','4.3','5','10','8']                  
})

知道如何处理这种清洁吗?

我们需要多个步骤,第一步找到数字,然后从explode开始,第二步从df_test开始,用cumcount创建附加秩键:注意,这假设数据帧已经按值排序

#1st part
df_test['value']=pd.to_numeric(df_test['value'])
df_test=df_test.sort_values('value',ascending=False)
df_test_comment['number']=df_test_comment['What to remove'].str.findall('d+')
df_test_comment=df_test_comment.explode('number')
#2nd part
m1=df_test['product']+(df_test.groupby('product').cumcount()+1).astype(str)
m2=df_test_comment['product']+df_test_comment['number']
dftest=df_test[~m1.isin(m2)].sort_index()

dftest
product  value
1         Egg    5.0
2         Egg    3.0
4   Small Egg    5.0
5   Small Egg    4.0
10      Wheat    1.0
11      Wheat    1.5
13       Rice    4.0
14       Rice    4.3
19     Garlic    5.0
21     Tomato   10.0
23     Ananas    8.0

最新更新