使用pandas_schema进行数据验证



我想使用Python Panda库读取CSV数据文件并创建可视化
首先,我决定验证数据
我想使用pandas_schema模块来验证每列的数据
初始数据文件有26列
我的代码:

from pandas_schema import Column, Schema 
from pandas_schema.validation import LeadingWhitespaceValidation, TrailingWhitespaceValidation, CanConvertValidation, MatchesPatternValidation, InRangeValidation, InListValidation

schema = Schema ([
Column('Symboling', [InRangeValidation(-3,3)] ) ,  #integer from -3 to 3 
Column('Normalized Loss', [InRangeValidation(65,256)] )  , # integer from 65 to 256
Column('Make',[LeadingWhitespaceValidation(), TrailingWhitespaceValidation()] )  , # text 
Column('Fuel Type', [InListValidation(['diesel', 'gas'])]), # diesel, gas
Column('Aspiration'), # text 
Column('Num of Doors' , [InListValidation(['two', 'four'])]), # text (two, four)
Column('Body Style' , [InListValidation(['hardtop', 'wagon','sedan','hatchback', 'convertible'])] ), # text: hardtop, wagon, sedan, hatchback, convertible 
Column('Drive Wheels' , [InListValidation(['4wd', 'fwd' , 'rwd'])]), # text: 4wd, fwd, rwd 
Column('Engine Location' , [InListValidation(['front', 'rear'])]), # text: front, rear
Column('Wheel Base' , [InRangeValidation([86.6,120.9])] ) ,  # decimal from 86.6 to 120.9 
Column('Length' , [InRangeValidation(65,256)] )  ,  # decimal from 141.1 to 208.1
Column('Width' , [InRangeValidation(60.3,72.3)] ) ,  # decimal from 60.3 to 72.3 
Column('Height' , [InRangeValidation(47.8,59.8)] ) ,   # decimal from 47.8 to 59.8
Column('Curb Weight' , [InRangeValidation(1488,4066)] ) ,   # integer from 1488 to 4066
Column('Engine Type'),[InListValidation(['ohc', 'dohcv', 'l', 'ohc', 'ohcf', 'ohcv', 'rotor'])] , # text 
Column('Num of Cylinders' , [InListValidation(['two','four','three','five','six','eight','twelve'])]) , # text: eight, five, four, six, three, twelve, two 
Column('Engine Size' , [InRangeValidation(61,326)]) ,  # integer from 61 to 326 
Column('Fuel System' , [InListValidation(['1bbl', '2bbl', '4bbl', 'idi','mfi','mpfi','spdi','spfi'])]), #string: 1bbl, 2bbl, 4bbl, idi,mfi,mpfi,spdi,spfi 
Column('Bore' , [InRangeValidation(2.54,3.94)] ) , # decimal from 2.54 to 3.94 
Column('Stroke', [InRangeValidation(2.07,4.17)] ) , #decimal from 2.07 to 4.17 
Column('Compression Ratio' , [InRangeValidation(7,23)] ), #  integer: from 7 to 23 
Column('Horsepower' , [InRangeValidation(48,288)] ),  # integer:from 48 to 288 
Column('Peak rmp'), [InRangeValidation(4150,6600)]  , # integer: from 4150 to 6600 
Column('City mpg'), [InRangeValidation(13,49)]  , #integer: from 13 to 49 
Column('Highway mpg'), [InRangeValidation(16,54)] ,  # integer: 16 to 54 
Column('Price'), [InRangeValidation(5118,45400)]  # integer from 5118 to 45400 
])
test_file = pd.read_csv(('E:_Python_Projects_DataData_VisualizationAutos_Data_SetAutos_Import_1985.csv')) 
errors = schema.validate(test_file) 
for error in errors: 
print(error)

运行代码后,我收到通知:
The invalid number of columns. The schema specifies 31, but the data frame has 26

我实际上不明白这是怎么发生的:在模式中,我有26列;并且该数据文件具有26列。有什么建议吗
谢谢

请注意:Schema定义中有一些拼写错误,列定义中的括号过早闭合。这将创建一个包含31个元素的列表,这些元素被解释为列。

正确的定义应该是:

schema = Schema([
Column('Symboling', [InRangeValidation(-3,3)]),  #integer from -3 to 3 
Column('Normalized Loss', [InRangeValidation(65,256)]), # integer from 65 to 256
Column('Make', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()] ), # text 
Column('Fuel Type', [InListValidation(['diesel', 'gas'])]), # diesel, gas
Column('Aspiration'), # text 
Column('Num of Doors', [InListValidation(['two', 'four'])]), # text (two, four)
Column('Body Style', [InListValidation(['hardtop', 'wagon','sedan','hatchback', 'convertible'])]), # text: hardtop, wagon, sedan, hatchback, convertible 
Column('Drive Wheels', [InListValidation(['4wd', 'fwd', 'rwd'])]), # text: 4wd, fwd, rwd 
Column('Engine Location', [InListValidation(['front', 'rear'])]), # text: front, rear
Column('Wheel Base', [InRangeValidation([86.6,120.9])]),  # decimal from 86.6 to 120.9 
Column('Length', [InRangeValidation(65,256)]),  # decimal from 141.1 to 208.1
Column('Width', [InRangeValidation(60.3,72.3)]),  # decimal from 60.3 to 72.3 
Column('Height', [InRangeValidation(47.8,59.8)]),   # decimal from 47.8 to 59.8
Column('Curb Weight', [InRangeValidation(1488,4066)]),   # integer from 1488 to 4066
Column('Engine Type', [InListValidation(['ohc', 'dohcv', 'l', 'ohc', 'ohcf', 'ohcv', 'rotor'])]), # text 
Column('Num of Cylinders', [InListValidation(['two','four','three','five','six','eight','twelve'])]), # text: eight, five, four, six, three, twelve, two 
Column('Engine Size', [InRangeValidation(61,326)]),  # integer from 61 to 326 
Column('Fuel System', [InListValidation(['1bbl', '2bbl', '4bbl', 'idi','mfi','mpfi','spdi','spfi'])]), #string: 1bbl, 2bbl, 4bbl, idi,mfi,mpfi,spdi,spfi 
Column('Bore', [InRangeValidation(2.54,3.94)]), # decimal from 2.54 to 3.94 
Column('Stroke', [InRangeValidation(2.07,4.17)]), #decimal from 2.07 to 4.17 
Column('Compression Ratio', [InRangeValidation(7,23)]), #  integer: from 7 to 23 
Column('Horsepower', [InRangeValidation(48,288)]),  # integer:from 48 to 288 
Column('Peak rmp', [InRangeValidation(4150,6600)]), # integer: from 4150 to 6600 
Column('City mpg', [InRangeValidation(13,49)]), #integer: from 13 to 49 
Column('Highway mpg', [InRangeValidation(16,54)]),  # integer: 16 to 54 
Column('Price', [InRangeValidation(5118,45400)]),  # integer from 5118 to 45400 
])

Dataframe列必须与定义的验证架构中的列数相匹配。另一种方法是定义一个新的数据帧,其中包含要比较的列列表,并将其用于验证。(不确定这是否是最有效的方法,但它解决了目的(

相关内容

  • 没有找到相关文章

最新更新