re.IGNORCASE标志不适用于.str.extract



我有下面的数据框架,并创建了一列,根据字符串中的特定文本进行分类。

然而,当我通过re.IGNORECASE标志时,它仍然区分大小写?

数据帧

test_data = {
"first_name": ['Bruce', 'Clark', 'Bruce', 'James', 'Nanny', 'Dot'],
"last_name": ['Lee', 'Kent', 'Banner', 'Bond', 'Mc Phee', 'Cotton'],
"title": ['mr', 'mr', 'mr', 'mr', 'mrs', 'mrs'],
"text": ["He is a Kung Fu master", "Wears capes and tight Pants", "Cocktails shaken not stirred", "angry Green man", "suspect scottish accent", "East end legend"],
"age": [32, 33, 28, 30, 42, 80]
}
df = pd.DataFrame(test_data)

代码

category_dict = {
"Kung Fu":"Martial Art", 
"capes":"Clothing", 
"cocktails": "Drink", 
"green": "Colour", 
"scottish": "Scotland", 
"East": "Direction"
}
df['category'] = (
df['text'].str.extract(
fr"b({'|'.join(category_dict.keys())})b",
flags=re.IGNORECASE)[0].map(category_dict))

预期输出

first_name last_name title     text                     age     category
0      Bruce       Lee    Mr        He is a Kung Fu master   32  Martial Art
1      Clark      Kent    Mr   Wears capes and tight Pants   33     Clothing
2      Bruce    Banner    Mr  Cocktails shaken not stirred   28        Drink
3      James      Bond    Mr               angry Green man   30       Colour
4      Nanny   Mc Phee   Mrs       suspect scottish accent   42     Scotland
5        Dot    Cotton   Mrs               East end legend   80    Direction

我已经搜索了文档,但没有找到任何指针,所以任何帮助都将不胜感激!

这里有一种方法可以实现

您面临的问题是,虽然提取忽略了大小写,但提取到dictionary的字符串映射仍然区分大小写。

#create a dictionary with lower case keys
cd= {k.lower(): v for k,v in category_dict.items()}
# alternately, you can convert the category_dict keys to lower case
# I duplicated the dictionary, in case you need to keep the original keys

# convert the extracted word to lowercase and then map with the lowercase dict
df['category'] = (
df['text'].str.extract(
fr"b({'|'.join((category_dict.keys()))})b",
flags=re.IGNORECASE)[0].str.lower().map(cd))  
df
first_name  last_name   title                     text     age  category
0   Bruce       Lee         mr        He is a Kung Fu master    32  Martial Art
1   Clark       Kent        mr   Wears capes and tight Pants    33  Clothing
2   Bruce       Banner      mr  Cocktails shaken not stirred    28  Drink
3   James       Bond        mr               angry Green man    30  Colour
4   Nanny       Mc Phee     mrs      suspect scottish accent    42  Scotland
5   Dot         Cotton      mrs              East end legend    80  Direction

最新更新