如何从HTML代码中正确提取URL?

我已将网站的HTML代码保存在计算机上的.txt文件中。我想使用以下代码从此文本文件中提取所有 URL：

def get_net_target(page):
start_link=page.find("href=")
start_quote=page.find('"',start_link)
end_quote=page.find('"',start_quote+1)
url=page[start_quote+1:end_quote]
return url
my_file = open("test12.txt")
page = my_file.read()
print(get_net_target(page))

但是，该脚本仅打印第一个 URL，而不打印所有其他链接。这是为什么呢？

您需要实现一个循环来遍历所有 URL。

print(get_net_target(page))只打印在page中找到的第一个 URL，因此您需要一次又一次地调用此函数，每次将page替换为子字符串page[end_quote+1:]直到找不到更多 URL。

为了帮助您入门，next_index将存储最后一个结束 URL 位置，然后循环将检索以下 URL：

next_index = 0 # the next page position from which the URL search starts
def get_net_target(page):
global next_index
start_link=page.find("href=")
if start_link == -1: # no more URL
return ""
start_quote=page.find('"',start_link)
end_quote=page.find('"',start_quote+1)
next_index=end_quote
url=page[start_quote+1:end_quote]
end_quote=5
return url

my_file = open("test12.txt")
page = my_file.read()
while True:
url = get_net_target(page)
if url == "": # no more URL
break
print(url)
page = page[next_index:] # continue with the page

还要小心，因为您只检索包含在"内的链接，但它们可以被'甚至什么都没有......

相关内容

最新更新

热门标签：