我有下面的数据框架,并创建了一列,根据字符串中的特定文本进行分类。
然而,当我通过re.IGNORECASE
标志时,它仍然区分大小写?
数据帧
test_data = {
"first_name": ['Bruce', 'Clark', 'Bruce', 'James', 'Nanny', 'Dot'],
"last_name": ['Lee', 'Kent', 'Banner', 'Bond', 'Mc Phee', 'Cotton'],
"title": ['mr', 'mr', 'mr', 'mr', 'mrs', 'mrs'],
"text": ["He is a Kung Fu master", "Wears capes and tight Pants", "Cocktails shaken not stirred", "angry Green man", "suspect scottish accent", "East end legend"],
"age": [32, 33, 28, 30, 42, 80]
}
df = pd.DataFrame(test_data)
代码
category_dict = {
"Kung Fu":"Martial Art",
"capes":"Clothing",
"cocktails": "Drink",
"green": "Colour",
"scottish": "Scotland",
"East": "Direction"
}
df['category'] = (
df['text'].str.extract(
fr"b({'|'.join(category_dict.keys())})b",
flags=re.IGNORECASE)[0].map(category_dict))
预期输出
first_name last_name title text age category
0 Bruce Lee Mr He is a Kung Fu master 32 Martial Art
1 Clark Kent Mr Wears capes and tight Pants 33 Clothing
2 Bruce Banner Mr Cocktails shaken not stirred 28 Drink
3 James Bond Mr angry Green man 30 Colour
4 Nanny Mc Phee Mrs suspect scottish accent 42 Scotland
5 Dot Cotton Mrs East end legend 80 Direction
我已经搜索了文档,但没有找到任何指针,所以任何帮助都将不胜感激!
这里有一种方法可以实现
您面临的问题是,虽然提取忽略了大小写,但提取到dictionary的字符串映射仍然区分大小写。
#create a dictionary with lower case keys
cd= {k.lower(): v for k,v in category_dict.items()}
# alternately, you can convert the category_dict keys to lower case
# I duplicated the dictionary, in case you need to keep the original keys
# convert the extracted word to lowercase and then map with the lowercase dict
df['category'] = (
df['text'].str.extract(
fr"b({'|'.join((category_dict.keys()))})b",
flags=re.IGNORECASE)[0].str.lower().map(cd))
df
first_name last_name title text age category
0 Bruce Lee mr He is a Kung Fu master 32 Martial Art
1 Clark Kent mr Wears capes and tight Pants 33 Clothing
2 Bruce Banner mr Cocktails shaken not stirred 28 Drink
3 James Bond mr angry Green man 30 Colour
4 Nanny Mc Phee mrs suspect scottish accent 42 Scotland
5 Dot Cotton mrs East end legend 80 Direction