验证' prod_price '中的价格.删除任何明显错误的行



我试着在数据框架上工作。这是一个大数据,我必须删除不一致的行,但当我试图检查不一致,数据是如此之大,我总是得到错误的答案。

import pandas as pd
import numpy as np
from checker.binder import binder; binder.bind(globals())
from intro_data_analytics.check_scrubbing import *
df = pd.read_csv('data/inu_neko_orderline.csv')
df
trans_id    prod_upc    cust_id trans_timestamp trans_year  trans_month trans_day   trans_hour  trans_quantity  cust_age    cust_state  prod_price  prod_title  prod_category   prod_animal_type    prod_size   total_sales
0   10300097    719638485153    1001019 2021-01-01 07:35:21.439873  2021    1   1   1   1   20  NY  72.99   Cat Cave    bedding cat NaN 0
1   10300093    73201504044 1001015 2021-01-01 09:33:37.499660  2021    1   1   1   1   34  NY  18.95   Purrfect Puree  treat   cat NaN 0
2   10300093    719638485153    1001015 2021-01-01 09:33:37.499660  2021    1   1   1   1   34  NY  72.99   Cat Cave    bedding cat NaN 0
3   10300093    441530839394    1001015 2021-01-01 09:33:37.499660  2021    1   1   1   2   34  NY  28.45   Ball and String toy cat NaN 0
4   10300093    733426809698    1001015 2021-01-01 09:33:37.499660  2021    1   1   1   1   34  NY  18.95   Yum Fish-Dish   food    cat NaN 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
38619   10327860    287663658863    1022098 2021-06-30 15:37:12.821020  2021    6   30  30  1   25  New York    9.95    All Veggie Yummies  treat   dog NaN 0
38620   10327960    140160459467    1022157 2021-06-30 15:45:09.872732  2021    6   30  30  2   31  Pennsylvania    48.95   Snoozer Essentails  bedding dog NaN 0
38621   10328009    425361189561    1022189 2021-06-30 15:57:44.295104  2021    6   30  30  2   53  New Jersey  15.99   Snack-em Fish   treat   cat NaN 0
38622   10328089    733426809698    1022236 2021-06-30 15:59:29.801593  2021    6   30  30  1   23  Tennessee   18.95   Yum Fish-Dish   food    cat NaN 0
38623   10328109    717036112695    1011924 2021-06-30 17:30:52.205912  2021    6   30  30  1   24  Pennsylvania    60.99   Reddy Beddy bedding dog medium  0
38624 rows × 17 columns

表中有一行是测试行,价格值太大(6位数),而最大价格为$72。您需要删除这个测试行,然后数据将是干净的。

我通过下载coursera文件并使用google sheets检查它找到了这一行

我想补充一下Jalal刚才说的那个问题的答案。

虽然它是正确的,但根据我今天刚刚尝试的,如果总列数是错误的,当你检查它时,他们仍然会给出错误。即使在我过滤掉没有Float类型的数据之后,它仍然是错误的。因此,当我删除所有具有NaN值的行时,它们会给我一个通过。

很奇怪,我知道。

最新更新