如何使用正则表达式仅标识 pandas 数据帧中连续的 4-5 位数字



以下代码是使用硬编码数据帧设置的,这些数据帧可以复制并粘贴到anaconda3 jupyter笔记本中,并在表视图中查看。通过以"df2="开头的行复制并粘贴代码,并在 jupyter notebood 中运行代码,如果您键入 df1.head(14) 然后运行,您将看到输出。然后键入 df2.head(14) 以查看第二个数据框。然后,您可以运行其余代码,直到拥有合并的数据帧。

我需要帮助调整此代码以使其仅识别 4 到 5 位数字dept_nbrs。所以换句话说,我如何使用正则表达式来不匹配任何数字,除非它是 4 或 5 个连续数字长?我不想识别任何与城市相关的部门编号,这些数字都是一位、两位或三位数字(但我不能只是从原始数据框中删除这些数字。任何帮助都非常感谢。所需的结果如下所示:

Department         TrueDeparment    num_col Dept_Nbr    Dept_Desc_HR
0   Merch - 1854       Empty            1854    1854 Community Relations
1   1925 - WH          empty            1925    1925     Human Resources
2   Montreal 10        empty                 
3   CMI-General Liability | 05-9362 empty 9632  9632               Legal
4   Market 466         empty                
5   7763               empty            7763    7763       Merchandising
6   Realty Supply Chain 44-9635 empty   9635    9635          Contractor
7   Merchandising (18-9087) empty       9087    9087            Security
8   07-7882 | Supply Chain empty        7882    7882             Produce
9   6101-09-01 00:00:00 empty           6101    6101          Recruiting
10  Supply Chain Engineering 9826 empty 9826    9826                 R&D
11  Dept. 93 - Fresh Meat 01-9245 empty 9245    9245         Real Estate
12  Health & Wellness WBU Reg 54 empty  
13  US09027            empty            9027    9027     Retail Services

法典:

import pandas as pd
import re
df1 = pd.DataFrame({'Department' : ['Merch - 1854', '1925 - WH','Montreal 10','CMI-General Liability | 05-9362', 'Market 466','7763','Realty Supply Chain  44-9635','Merchandising (18-9087)','07-7882 | Supply Chain','6101-09-01 00:00:00','Supply Chain Engineering 9826','Dept. 93 - Fresh Meat 01-9245','Health & Wellness WBU Reg 54','US09027'],'TrueDeparment' : ['Empty','empty','empty','empty','empty','empty','empty','empty','empty','empty','empty','empty','empty','empty']})   
df2 = pd.DataFrame({'Dept_Nbr' : [1,5,7,9,10,18,44,54,93,466,1854,1925,6101,7763,7882,9027,9087,9245,9362,9635,9826], 'Dept_Desc_HR' : ['springfield','new orleans','san diego','new york','cleveland','orlando','san francisco','st louis','kansas city','detroit','Community Relations','Human Resources','Recruiting','Merchandising','Produce','Retail Services','Security','RealEstate','Legal','Contractor','R&D']})
line = 'Merch - 1854 '
match = re.search(r'[0-9]+', line)
if match is None:
print(0)
else:
print(int(match[0]))
def extract_number(field):
match = re.search(r'[0-9]+', field)
if match is None:
return 0
else:
return int(match[0])
df1['num_col'] = df1[['Department']].apply(lambda row:extract_number(row['Department']),axis=1)
df1.merge(df2, left_on = ['num_col'], right_on = ['Dept_Nbr'])
import pandas as pd
import re
df1 = pd.DataFrame({'Department' : ['Merch - 1854', '1925 - WH','Montreal 10','CMI-General Liability | 05-9362', 'Market 466','7763','Realty Supply Chain  44-9635','Merchandising (18-9087)','07-7882 | Supply Chain','6101-09-01 00:00:00','Supply Chain Engineering 9826','Dept. 93 - Fresh Meat 01-9245','Health & Wellness WBU Reg 54','US09027'],'TrueDeparment' : ['Empty','empty','empty','empty','empty','empty','empty','empty','empty','empty','empty','empty','empty','empty']})   
df2 = pd.DataFrame({'Dept_Nbr' : [1,5,7,9,10,18,44,54,93,466,1854,1925,6101,7763,7882,9027,9087,9245,9362,9635,9826], 'Dept_Desc_HR' : ['springfield','new orleans','san diego','new york','cleveland','orlando','san francisco','st louis','kansas city','detroit','Community Relations','Human Resources','Recruiting','Merchandising','Produce','Retail Services','Security','RealEstate','Legal','Contractor','R&D']})

如@Jan所述,您可以使用正则表达式 \d{4,5},它匹配最少 4 个,最多匹配 5 个连续数字字符。

line = 'Merch - 1854 '
match = re.search(r'd{4,5}', line)
if match is None:
print(0)
else:
print(int(match[0]))

现在熊猫带有内置的字符串操作,我们在这里要使用的是pd.Series.str.extract.我们指定expandFalse,只从每行获取第一个匹配项。如果我们不指定这一点,如果其中一行中有多个匹配项,pandas 将返回数据帧而不是系列。最后,请注意,我们在正则表达式的两侧有一个捕获组()这是摘录将写入新系列的内容。如果没有匹配项,它将只返回这些行的NaN

编辑:我们将列的 dtype 更改为浮点数(当列包含NaN个值时,整数将向上转换为浮点数),因此合并操作有效。

df1['num_col'] = df1.Department.str.extract(r'(d{4,5})', expand=False).astype(float)
df2.Dept_Nbr = df2.Dept_Nbr.astype(float)
df1.merge(df2, left_on = ['num_col'], right_on = ['Dept_Nbr'])

如果您只想匹配正好有 4 位或 5 位数字的数字,则可以更新正则表达式以使用 {} 运算符,该运算符表示要匹配的特定次数。

match = re.search(r'[0-9]{4,5}', line)

此外,由于您要匹配任何数字,因此可以使用 \d 运算符查找任何数字

match = re.search(r'd{4,5}', line)

4-5 位数字,周围没有其他数字:

(?<!d)d{4,5}(?!d)

需要断言。

最新更新