如何检查引用字符串列表和目标字符串之间是否存在匹配?



我有一个从数据帧转换的引用字符串列表。

引用字符串列表

brand_list = ['scurfa', 'seagull', 'seagull', 'seiko']

用于description_list的示例输入 1

VINTAGE KING SEIKO 44-9990 Gold Medallion,Manual Winding with mod caseback.Serviced 2019.

用于description_list的示例输入 2

Power reserve function at 12; push-pull crown at 4
Seiko NE57 auto movement with power reserve
Multilayered dial with SuperLuminova BG-W9

期望的输出

SEIKO 44-9990 #extract together with model name
Seiko NE57 #extract together with model name

这是我的示例代码,但输出不是我想要的

import nltk
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
import numpy as np
stop_words = set(stopwords.words('english'))
def clean(doc):
no_punct = ""
word_tokens = word_tokenize(doc.lower()) 
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
for w in word_tokens: 
if w not in stop_words: 
filtered_sentence.append(w) 
return filtered_sentence
description_list = clean(soup_content.find('blockquote', { "class": "postcontent restore" }).text)
if pandas.Series(np.array(description_list)).isin(np.array(brand_list)).any() == True:
brand_result = [i for i in description_list if i in brand_list] 
print(brand_result[0])
if pandas.Series(np.array(description_list)).isin(np.array(model_list)).any() == True:
model_result = [i for i in description_list if i in model_list] 
print(model_result[0])
else:
print('Unknown')
else:
print('Unknown')
print('Unknown')

我会选择正则表达式。

brand_list = ['scurfa', 'seagull', 'seagull', 'seiko']
regular_expression = rf"({'|'.join(brand_list)}) ([^s]+)"

关于此正则表达式的一些话:

  • 我们使用字符串构造函数rf""这意味着您希望此字符串既raw(re模块需要(又formattable(使用括号包含变量{}(
  • '|'.join(brand_list)能够获得类似(scurfa|seagull)的东西来匹配brand_list中任何所需的品牌
  • 添加([^s]+)可以在品牌(假设是型号名称(之后立即捕获单词

最后:

import re
description = """
VINTAGE KING SEIKO 44-9990 Gold Medallion,Manual Winding with mod caseback.Serviced 2019.
Power reserve function at 12; push-pull crown at 4
Seiko NE57 auto movement with power reserve
Multilayered dial with SuperLuminova BG-W9
Testing for a ScURFA 42342
"""
print([" ".join(t) for t in re.findall(regular_expression, description, re.IGNORECASE)])

这给了:

['SEIKO 44-9990', 'Seiko NE57', 'ScURFA 42342']

最新更新