我一直在寻找只从字符串中提取一个12个字符的单词,如果它存在的话。
需要检查前5个字符是否来自给定列表,并检查后3个字符是否为数字。
输入数据(data .xlsx):
Description Number
CHQ -AQBCN222Q546 from India Federation Pvt Ltd
CHQN#DJBNK220Q329 from Indiana Basics Software Ltd -BC003
CASH- NJRQC225J987^ from US Fertilizers LLP
CHQ - from India Bulls Pvt Ltd
AQBCN222Q989 from India Bulls Pvt Ltd
CHQ -AQCCN222Q546 from India Federation Pvt Ltd
CASH - AQBCN222Q546289 from India Federation Pvt Ltd
list_Character - ['AQBCN','PUCNQ','DJBNK','ADJBC','NJRQC']
预期输出:
Description Number
CHQ -AQBCN222Q546 from India Federation Pvt Ltd AQBCN222Q546
CHQN#DJBNK220Q329 from Indiana Basics Software Ltd -BC003 DJBNK220Q329
CASH- NJRQC225J987^ from US Fertilizers LLP NJRQC225J987
CHQ - from India Bulls Pvt Ltd
AQBCN222Q989 from India Bulls Pvt Ltd AQBCN222Q989
CHQ -AQCCN222Q546 from India Federation Pvt Ltd
CASH - AQBCN222Q546289 from India Federation Pvt Ltd
代码:
import pandas as pd
import re
df = pd.read_excel(r'D:/Users/Data.xlsx')
list_Character - ['AQBCN','PUCNQ','DJBNK','ADJBC','NJRQC']
regex = r'[#-]((?:' + r'|'.join(list_Character) + r')w{5})b'
df["Number"] = df["Description"].str.extract(regex)
我找不到解决办法。我已经尝试从检查是否有任何10个字符的单词可用的字符串中获取参考,如果存在,提取单词,但它没有工作。
您可以稍微修改一下正则表达式,以删除首字符匹配并匹配7个额外字符:
list_Character = ['AQBCN','PUCNQ','DJBNK','ADJBC','NJRQC']
regex = r'((?:' + r'|'.join(list_Character) + r')w{7})b'
df["Number"] = df["Description"].str.extract(regex)
输出:
Description Number
0 CHQ -AQBCN222Q546 from India Federation Pvt Ltd AQBCN222Q546
1 CHQN#DJBNK220Q329 from Indiana Basics Software... DJBNK220Q329
2 CASH- NJRQC225J987^ from US Fertilizers LLP NJRQC225J987
3 CHQ - from India Bulls Pvt Ltd NaN
4 AQBCN222Q989 from India Bulls Pvt Ltd AQBCN222Q989
5 CHQ -AQCCN222Q546 from India Federation Pvt Ltd NaN
6 CASH - AQBCN222Q546289 from India Federation P... NaN