我正在尝试验证数据框架中特定正则表达式上的列。数字的限制是(20,3),即int数据类型的最大长度为20,float数据类型的最大长度为23。但是pandas将原始数字转换为随机整数,我的正则表达式验证失败了。我检查了我的正则表达式是否正确。
Dataframe:
FirstColumn,SecondColumn,ThirdColumn
111900987654123.123,111900987654123.123,111900987654123.123
111900987654123.12,111900987654123.12,111900987654123.12
111900987654123.1,111900987654123.1,111900987654123.1
111900987654123,111900987654123,111900987654123
111900987654123,-111900987654123,-111900987654123
-111900987654123.123,-111900987654123.123,-111900987654123.1
-111900987654123.12,-111900987654123.12,-111900987654123.12
-111900987654123.1,-111900987654123.1,-111900987654123.1
11119009876541231111,1111900987654123,1111900987654123
代码:
NumberValidationRegexnegative = r"^-?[0-9]{1,20}(?:.[0-9]{1,3})?$"
df_CPCodeDF=pd.read_csv("D:\FTPLocalUser\NCCLCOLL\COLLATERALUPLOAD\upld\SplitFiles\AACCR6675H_22102021_07_1 - Copy.csv")
pd.set_option('display.float_format', '{:.3f}'.format)
rslt_df2=df_CPCodeDF[df_CPCodeDF.iloc[:, 0].notna()]
rslt_df1=rslt_df2[~rslt_df2.iloc[:,0].apply(str).str.contains(NumberValidationRegexnegative, regex=True)].index
print("rslt_df1",rslt_df1)
输出结果:
rslt_df1 Int64Index([8], dtype='int64')
预期结果:
rslt_df1 Int64Index([], dtype='int64')
使用dtype=str
作为pd.read_csv
的参数:
NumberValidationRegexnegative = r"^-?[0-9]{1,20}(?:.[0-9]{1,3})?$"
df_CPCodeDF = pd.read_csv("data.csv", dtype=str)
rslt_df2 = df_CPCodeDF[df_CPCodeDF.iloc[:, 0].notna()]
rslt_df1 = rslt_df2[~rslt_df2.iloc[:,0]
.str.contains(NumberValidationRegexnegative, regex=True)].index
输出:
>>> print("rslt_df1", rslt_df1)
rslt_df1 Int64Index([], dtype='int64')