Pandas如何解析列值以使用正则表达式将字符串提取为int



我有一个两列的Pandas df:

name       Count_Relationship
0   allicin    DOWNREGULATE: 1
1   allicin    DOWNREGULATE: 2
2   allicin    UPREGULATE: 1 | DOWNREGULATE: 1
3   aspirin    UPREGULATE: 5 | DOWNREGULATE: 1
4   albuterol  DOWNREGULATE: 1
5   albuterol  UPREGULATE: 3

我想只过滤掉行,如果我按"名称"分组,并在"Count_Relationship"列中计数,则downregulation的量要大于uregulation的量。在这种情况下,大蒜素会有下调1+2+1=4和上调=1,所以num_下调> num_上调,而在其他(阿司匹林,沙丁胺醇)情况并非如此。我想返回这个过滤后的df:

name      Count_Relationship
0   allicin   DOWNREGULATE: 1
1   allicin   DOWNREGULATE: 2
2   allicin   UPREGULATE: 1 | DOWNREGULATE: 1

列Count_Relationship是一个字符串,所以我必须解析字符串的数字部分并将其转换为int。

我试过了:

import pandas as pd
data = {'name': ['allicin', 'allicin', 'allicin', 'aspirin', 'albuterol', 'albuterol'],
'Count_Relationship': ['DOWNREGULATE: 1', 'DOWNREGULATE: 2', 'UPREGULATE: 1 | DOWNREGULATE: 1', 'UPREGULATE: 5 | DOWNREGULATE: 1', 'DOWNREGULATE: 1' , 'UPREGULATE: 3']
}
df = pd.DataFrame(data)
substances = df["name"].tolist()
substances = list(set(substances)) # to get the unique names
result_substances = []

for substance in (substances):
try:
numberOfdownregulate = df[(df["name"] == substance) & (
(df["Count_Relationship"].str.match(pat = '("DOWNREGULATE:"([0-9]))')).values[0].astype(int)        
except:
pass
try:    
numberOfupregulate = df[(df["name"] == substance) & (
(df["Count_Relationship"].str.match(pat = '("UPREGULATE:"([0-9]))')).values[0].astype(int)
except:
pass

result = numberOfdownregulate - numberOfupregulate

if result > 0:
result_substances.append(substance)

df_filtered = df[df["name"].isin(result_substances)]

,但我得到一个语法错误在行numberofdownregulation我的正则表达式是。如何修正算法?非常感谢

您可以提取信息,比较上下,并构建一个掩码来选择数据:

drugs = (df.join(df['Count_Relationship'].str.extractall('(?P<down>(?<=DOWNREGULATE: )d+)|(?P<up>(?<=UPREGULATE: )d+)')
.groupby(level=0).first().fillna(0).astype(int)
)
.groupby('name').agg({'down': 'sum', 'up': 'sum'})
.query('down >= up')
.index
)
df[df['name'].isin(drugs)]

输出:

name               Count_Relationship
0  allicin                  DOWNREGULATE: 1
1  allicin                  DOWNREGULATE: 2
2  allicin  UPREGULATE: 1 | DOWNREGULATE: 1

我建议将downregulation和UPREGULATE值提取到不同的列中,然后应用按名称分组的值的总和并检查哪个更大。

下面的例子创建了一个名为UP_gt_DOWN的布尔列,字面意思是upulate大于downregulation:

df['UPREGULATE'] = df['Count_Relationship'].str.extract(r"UPREGULATE: (d*)").fillna(0).astype(int)
df['DOWNREGULATE'] = df['Count_Relationship'].str.extract(r"DOWNREGULATE: (d*)").fillna(0).astype(int)
summed_df = df.groupby('name').sum()
summed_df['UP_gt_DOWN'] = summed_df['UPREGULATE'] > summed_df['DOWNREGULATE']
print(summed_df)
# Output
#            UPREGULATE  DOWNREGULATE  UP_gt_DOWN
# name                                           
# albuterol           3             1        True
# allicin             1             4       False
# aspirin             5             1        True
filtered_drugs = summed_df[~summed_df['UP_gt_DOWN']].index
print(df[df['name'].isin(filtered_drugs)])
# Output
#       name               Count_Relationship  UPREGULATE  DOWNREGULATE
# 0  allicin                  DOWNREGULATE: 1           0             1
# 1  allicin                  DOWNREGULATE: 2           0             2
# 2  allicin  UPREGULATE: 1 | DOWNREGULATE: 1           1             1

相关内容

  • 没有找到相关文章

最新更新