我有一个从数据帧转换的引用字符串列表。
引用字符串列表
brand_list = ['scurfa', 'seagull', 'seagull', 'seiko']
用于description_list的示例输入 1
VINTAGE KING SEIKO 44-9990 Gold Medallion,Manual Winding with mod caseback.Serviced 2019.
用于description_list的示例输入 2
Power reserve function at 12; push-pull crown at 4
Seiko NE57 auto movement with power reserve
Multilayered dial with SuperLuminova BG-W9
期望的输出
SEIKO 44-9990 #extract together with model name
Seiko NE57 #extract together with model name
这是我的示例代码,但输出不是我想要的
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import numpy as np
stop_words = set(stopwords.words('english'))
def clean(doc):
no_punct = ""
word_tokens = word_tokenize(doc.lower())
filtered_sentence = [w for w in word_tokens if not w in stop_words]
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
return filtered_sentence
description_list = clean(soup_content.find('blockquote', { "class": "postcontent restore" }).text)
if pandas.Series(np.array(description_list)).isin(np.array(brand_list)).any() == True:
brand_result = [i for i in description_list if i in brand_list]
print(brand_result[0])
if pandas.Series(np.array(description_list)).isin(np.array(model_list)).any() == True:
model_result = [i for i in description_list if i in model_list]
print(model_result[0])
else:
print('Unknown')
else:
print('Unknown')
print('Unknown')
我会选择正则表达式。
brand_list = ['scurfa', 'seagull', 'seagull', 'seiko']
regular_expression = rf"({'|'.join(brand_list)}) ([^s]+)"
关于此正则表达式的一些话:
- 我们使用字符串构造函数
rf""
这意味着您希望此字符串既raw
(re
模块需要(又formattable
(使用括号包含变量{}
( '|'.join(brand_list)
能够获得类似(scurfa|seagull)
的东西来匹配brand_list
中任何所需的品牌- 添加
([^s]+)
可以在品牌(假设是型号名称(之后立即捕获单词
最后:
import re
description = """
VINTAGE KING SEIKO 44-9990 Gold Medallion,Manual Winding with mod caseback.Serviced 2019.
Power reserve function at 12; push-pull crown at 4
Seiko NE57 auto movement with power reserve
Multilayered dial with SuperLuminova BG-W9
Testing for a ScURFA 42342
"""
print([" ".join(t) for t in re.findall(regular_expression, description, re.IGNORECASE)])
这给了:
['SEIKO 44-9990', 'Seiko NE57', 'ScURFA 42342']