用于排除包含特定单词的字符串的Python正则表达式



在抓取维基百科时,我试图使用正则表达式来排除歧义页面。我四处寻找关于使用负面前瞻的技巧我似乎无法使它工作。我想我错过了一些基本的东西关于它的用途,但到目前为止,我完全一无所知。有人能帮忙吗给我指正确的方向?(我不想用如果y中的"消歧",我正在努力抓住负面前瞻的工作原理。(非常感谢。这是代码:

list_links = ['/wiki/Oolong_(disambiguation)', '/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg',
'/wiki/Taiwanese_tea', '/wiki/Tung-ting_tea',
'/wiki/Nantou_County', '/wiki/Taiwan', '/wiki/Dongfang_Meiren',
'/wiki/Alishan_National_Scenic_Area', '/wiki/Chiayi_County',
'/wiki/Dayuling', '/wiki/Baozhong_tea', '/wiki/Pinglin_Township']
def findString(string):
regex1 = r'(/wiki/)(_($)(!?disambiguation)'
for x in list_links:
y =  re.findall(regex1, x)
print(y)
findString(list_links)```

您可以根据需要使用其中一个正则表达式。此外,我对函数定义添加了一些更改,以尊重PEP。

def remove_disambiguation_link(list_of_links):
regex = "(.*)((!?disambiguation))"
# regex = "(/wiki/)(.*)((!?disambiguation))"
# return [links for links in list_of_links if not re.search(regex, links)]
return list(filter(lambda link: not re.search(regex, link), list_of_links))
list_links = remove_disambiguation_link(list_links)
print(list_links)
[
"/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg",
"/wiki/Taiwanese_tea",
"/wiki/Tung-ting_tea",
"/wiki/Nantou_County",
"/wiki/Taiwan",
"/wiki/Dongfang_Meiren",
"/wiki/Alishan_National_Scenic_Area",
"/wiki/Chiayi_County",
"/wiki/Dayuling",
"/wiki/Baozhong_tea",
"/wiki/Pinglin_Township",
]

对于您的情况,最简单的解决方案就是不使用regex。。。只需做一些类似的事情:

list_links = ['/wiki/Oolong_(disambiguation)', '/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg',
'/wiki/Taiwanese_tea', '/wiki/Tung-ting_tea',
'/wiki/Nantou_County', '/wiki/Taiwan', '/wiki/Dongfang_Meiren',
'/wiki/Alishan_National_Scenic_Area', '/wiki/Chiayi_County',
'/wiki/Dayuling', '/wiki/Baozhong_tea', '/wiki/Pinglin_Township']
def findString(string):
regex1 = r'(/wiki/)(_($)'
for x in string:
if 'disambiguation' in x:
continue  # skip
y =  re.findall(regex1, x)
print(y)
findString(list_links)

您不需要使用regex。您可以遍历list_links,并检查您要查找的字符串"消歧"是否在list_links的每个项中。

list_links = ['/wiki/Oolong_(disambiguation)', '/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg',
'/wiki/Taiwanese_tea', '/wiki/Tung-ting_tea',
'/wiki/Nantou_County', '/wiki/Taiwan', '/wiki/Dongfang_Meiren',
'/wiki/Alishan_National_Scenic_Area', '/wiki/Chiayi_County',
'/wiki/Dayuling', '/wiki/Baozhong_tea', '/wiki/Pinglin_Township']
to_find = 'disambiguation'
def findString(list_links):
for link in list_links:
if to_find in link:
# get indice of match
match_index = list_links.index(link)
# remove match from list
list_links.pop(match_index)
# print new list without 'disambiguation' items
print(list_links)        
findString(list_links)

最新更新