在论坛的帮助下,我制作了一个脚本,可以捕捉到该页面主题的所有链接https://www.inforge.net/xi/forums/liste-proxy.1118/。这些主题包含代理列表。脚本是这样的:
import urllib.request, re
from bs4 import BeautifulSoup
url = "https://www.inforge.net/xi/forums/liste-proxy.1118/"
soup = BeautifulSoup(urllib.request.urlopen(url), "lxml")
base = "https://www.inforge.net/xi/"
for tag in soup.find_all("a", {"class":"PreviewTooltip"}):
links = tag.get("href")
final = [base + links]
final2 = urllib.request.urlopen(final)
for line in final2:
ip = re.findall("(?:[d]{1,3}).(?:[d]{1,3}).(?:[d]{1,3}).(?:[d]{1,3}):(?:[d]{1,5})", line)
ip = ip[3:-1]
for addr in ip:
print(addr)
输出为:
Traceback (most recent call last):
File "proxygen5.0.py", line 13, in <module>
sourcecode = urllib.request.urlopen(final)
File "/usr/lib/python3.5/urllib/request.py", line 162, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.5/urllib/request.py", line 456, in open
req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
我知道问题出在final2 = urllib.request.urlopen(final)
的部分,但我不知道如何解决
我能做些什么来打印ips?
这段代码应该做你想做的事,它被注释了,这样你就可以理解所有的段落:
import urllib.request, re
from bs4 import BeautifulSoup
url = "https://www.inforge.net/xi/forums/liste-proxy.1118/"
soup = BeautifulSoup(urllib.request.urlopen(url), "lxml")
base = "https://www.inforge.net/xi/"
# Iterate over all the <a> tags
for tag in soup.find_all("a", {"class":"PreviewTooltip"}):
# Get the link form the tag
link = tag.get("href")
# Compose the new link
final = base + link
print('Request to {}'.format(final)) # To know what we are doing
# Download the 'final' link content
result = urllib.request.urlopen(final)
# For every line in the downloaded content
for line in result:
# Find one or more IP(s), here we need to convert lines to string because `bytes` objects are given
ip = re.findall("(?:[d]{1,3}).(?:[d]{1,3}).(?:[d]{1,3}).(?:[d]{1,3}):(?:[d]{1,5})", str(line))
# If one ore more IP(s) are found
if ip:
# Print them on separate line
print('n'.join(ip))