在Python中,我比较包含字符串的数据帧来决定它是应该通过还是失败.当数据传递失败时,如何阻止数据传递?



我有超过20个测试用例来检查CSV中由于数据输入而导致的数据异常。这个测试用例(#15)比较了称呼和收信人的婚姻状况。

# Test case  15
# Compares MrtlStat to  PrimAddText and PrimSalText
df = data[data['MrtlStat'].str.contains("Widow|Divorced|Single")]
df = df[df['PrimAddText'].str.contains("AND|&", na=False)]
data_15 = df[df['PrimSalText'].str.contains("AND|&", na=False)]
# Adds row to list of failed data
ids = data_15.index.tolist()
# Keep track of data that failed test case 15 
for i in ids:
data.at[i,'Test Case Failed']+=', 15'

如果MrtlStat包含Widow, Divorced, or Single,而primadtext或PrimSalTexts包含AND或&,则测试不通过。此测试仅当PrimSalTexts和primadtext都包含and或&

显示通过但应该失败的数据的表:

MrtlStat,埃尔弗兰克太太,

您不应该按顺序过滤数据,而是将条件合并为单个条件(使用&和|)。一个很好的方法是numpy.where:

import pandas as pd
import numpy as np
# construct data
data = pd.DataFrame({
'PrimAddText': ['Mrs. Judith Elfrank', 'Mr. & Mrs.Karl Magnusen', 'Mr. & Mrs. Elfrank'],
'PrimSalText': ['Mr. & Mrs. Elfrank & Michael', 'Mr. Magnusen', 'Mr. & Mrs. Elfrank & Michael'],
'MrtlStat': ['Widowed', 'Widowed', 'Widowed']
})
# Case 15 - create condition
data['Status_case15'] = np.where((data['MrtlStat'].str.contains("Widow|Divorced|Single") 
& (data['PrimAddText'].str.contains("AND|&", na=False) 
| data['PrimSalText'].str.contains("AND|&", na=False))), False, True)
# filter failing rows
data.loc[data['Status_case15'] == False]
# sum correct rows
sum(data['Status_case15'])

您有一个AND条件b/w第二个和第三个条件,您可以将它们分开并从每个条件捕获结果。最后将两个列表合并在一起

# Test case  15
# Compares MrtlStat to  PrimAddText and PrimSalText
df = data[data['MrtlStat'].str.contains("Widow|Divorced|Single")]
data_15_A = df[df['PrimAddText'].str.contains("AND|&", na=False)]
data_15_B = df[df['PrimSalText'].str.contains("AND|&", na=False)]
# Adds row to list of failed data
ids = data_15_A.index.tolist() + data_15_B.index.tolist()
# Keep track of data that failed test case 15 
for i in ids:
data.at[i,'Test Case Failed']+=', 15'

最新更新