用python3捕获链接和ip



在论坛的帮助下,我制作了一个脚本,可以捕捉到该页面主题的所有链接https://www.inforge.net/xi/forums/liste-proxy.1118/。这些主题包含代理列表。脚本是这样的:

import urllib.request, re
from bs4 import BeautifulSoup
url = "https://www.inforge.net/xi/forums/liste-proxy.1118/"
soup = BeautifulSoup(urllib.request.urlopen(url), "lxml")
base = "https://www.inforge.net/xi/"
for tag in soup.find_all("a", {"class":"PreviewTooltip"}):
    links = tag.get("href")
    final = [base + links]
final2 = urllib.request.urlopen(final)
for line in final2:
    ip = re.findall("(?:[d]{1,3}).(?:[d]{1,3}).(?:[d]{1,3}).(?:[d]{1,3}):(?:[d]{1,5})", line)
    ip = ip[3:-1]
for addr in ip:
    print(addr)

输出为:

Traceback (most recent call last):
  File "proxygen5.0.py", line 13, in <module>
    sourcecode = urllib.request.urlopen(final)
  File "/usr/lib/python3.5/urllib/request.py", line 162, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.5/urllib/request.py", line 456, in open
    req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'

我知道问题出在final2 = urllib.request.urlopen(final)的部分,但我不知道如何解决

我能做些什么来打印ips?

这段代码应该做你想做的事,它被注释了,这样你就可以理解所有的段落:

import urllib.request, re
from bs4 import BeautifulSoup
url = "https://www.inforge.net/xi/forums/liste-proxy.1118/"
soup = BeautifulSoup(urllib.request.urlopen(url), "lxml")
base = "https://www.inforge.net/xi/"
# Iterate over all the <a> tags
for tag in soup.find_all("a", {"class":"PreviewTooltip"}):
    # Get the link form the tag
    link = tag.get("href")
    # Compose the new link
    final = base + link
    print('Request to {}'.format(final))    # To know what we are doing
    # Download the 'final' link content
    result = urllib.request.urlopen(final)
    # For every line in the downloaded content
    for line in result:
        # Find one or more IP(s), here we need to convert lines to string because `bytes` objects are given
        ip = re.findall("(?:[d]{1,3}).(?:[d]{1,3}).(?:[d]{1,3}).(?:[d]{1,3}):(?:[d]{1,5})", str(line))
        # If one ore more IP(s) are found
        if ip:
            # Print them on separate line
            print('n'.join(ip))

相关内容

  • 没有找到相关文章

最新更新