我有以下示例数据集:
Protocol Number: xx-yzm2
Section Major Task Budget
1 Study Setup 25303.18
2 Study Setup Per-Location 110037.8
3 Site Identified by CRO 29966.25
4 Pre-study Site Visit (PSSV) 130525.92
我想用contains
搜索整个数据帧,并传递关键字"protocol"并返回其旁边的值。
理论上,表单可能会更改,因此我无法按列进行筛选。这可能和熊猫有关吗?
输入关键字为:protocol
输出为xx-yzm2
您可以尝试以下操作:
import pandas as pd
import numpy as np
data = {0: ['Protocol Number:', np.nan, 'Section Major', '1', '2', '3', '4'],
1: ['xx-yzm2', np.nan, 'Task', 'Study Setup', 'Study Setup Per-Location',
'Site Identified by CRO', 'Pre-study Site Visit (PSSV)'],
2: [np.nan, np.nan, 'Budget', '25303.18', '110037.8', '29966.25', '130525.92']}
df = pd.DataFrame(data)
0 1 2
0 Protocol Number: xx-yzm2 NaN
1 NaN NaN NaN
2 Section Major Task Budget
3 1 Study Setup 25303.18
4 2 Study Setup Per-Location 110037.8
5 3 Site Identified by CRO 29966.25
6 4 Pre-study Site Visit (PSSV) 130525.92
keyword = 'protocol'
# case-insensitive: case=False
# row: array([0], dtype=int64), col: array([0], dtype=int64)
row, col = np.where(df.apply(lambda x: x.astype(str).str.
contains(keyword, case=False)))
result = df.iat[row[0],col[0]+1]
print(result)
# xx-yzm2
如果您有多个匹配项,则以上操作将仅为第一个匹配项。如果您想获得所有匹配项,只需使用循环。在这种情况下,可能会添加一些检查来错误处理边界情况。
for i in range(len(row)):
if not col[i]+1 == len(df.columns):
print(df.iat[row[i],col[i]+1])
else:
# error handle, you're keyword was found in last column,
# i.e. there is no `next` col
pass