检查一个URL是否相对于另一个(即.它们在同一个主机上)



我有一个基本的HTTP URL和其他HTTP URL的列表。我正在编写一个简单的爬虫/链接检查器作为研究(因此,不需要建议预先编写的工具),检查基础URL是否有任何断开的链接,并递归地爬行所有其他"内部"页面(即。从同一站点内的基本URL链接的页面)具有相同的意图。最后,我必须输出带有其状态(外部/内部)的链接列表,并为每个实际上是内部但以绝对url形式呈现的链接发出警告。

到目前为止,我成功地检查了所有的链接,并使用请求和BeautifulSoup库进行爬行,但我找不到一个已经写好的方法来检查两个绝对url是否指向同一个网站(除了沿着斜线分割url,这对我来说似乎很丑)。这里有著名的图书馆吗?

最后我选择了urlparse(感谢@padraic-cunningham给我指出了它)。在代码的开头,我解析"基础URL"(即。我开始爬的那个):

base_parts = urlparse.urlparse(base_url)

则对于我找到的每个链接(例如for a in soup.find_all('a'):

)
link_parts = urlparse.urlparse(a.get('href'))

在这一点上,我必须比较URL方案(我考虑链接到同一网站与不同的URL方案,http或https,不同;以后我可能会让这个比较成为可选的):

internal = base_parts.scheme == link_parts.scheme 
           and base_parts.netloc == link_parts.netloc

,到这里,内部将是True,如果链接指向相同的服务器(具有相同的方案)作为我的基础URL。您可以在这里查看最终结果

我自己写了一个爬虫。我希望这对你有帮助。基本上,我所做的是将url添加到/2/2/3/index.php这样的网站,这将使网站成为http://www.website.com/2/2/3/index.php。然后我将所有的网站插入到一个数组中,这个数组检查我之前是否访问过这个网站,如果我访问过,它就不会访问那里。此外,如果这个网站有一些不相关的网站,比如一个youtube视频的链接,那么它将不会抓取youtube或任何其他不"网站相关"的网站。

对于你的问题,我建议你把所有访问过的网站放在一个数组中,并用For循环检查数组。如果URL与数组相同,则打印它。

我不确定这是你想要的,但至少我尝试了。我不使用BeautifulSoup,它仍然有效,所以考虑把这个模块放在一边。

我的脚本(更像是它的一部分。我也有异常检查,所以不要惊慌):

__author__ = "Sploit"

# This part is about import the default python modules and the modules that the user have to download
# If the module does not exist, the script asks him to install that specific module
import os  # This module provides a portable way of using operating system dependent functionality
import urllib  # The urllib module provides a simple interface for network resource access
import urllib2  # The urllib2 module provides a simple interface for network resource access
import time  # This module provides various time-related functions
import urlparse  # This module defines a standard interface to break URL strings up in components
                 # to combine the components back into a URL string, and to convert a relative URL to an absolute URL given a base URL.
import mechanize
print ("Which website would you like to crawl?")
website_url = raw_input("--> ")
# Ads http:// to the given URL because it is the only way to check for server response
# If the user will add to the URL directions then they will be deleted
# Example: 'https://moz.com/learn/seo/external-link' will turn to 'https://moz.com/'
if website_url.split('//')[0] != 'http:' and website_url.split('//')[0] != 'https:':
    website_url = 'http://' + website_url
website_url = website_url.split('/')[0] + '//' + website_url.split('/')[2]
# The user will stuck in a loop until a valid website is exist, using the application layer of the OSI module, HTTP Protocol
while True:
    try:
        if urllib2.urlopen(website_url).getcode() != 200:
            print ("Invalid URL given. Which website would you like to crawl?")
            website_url = raw_input("--> ")
        else:
            break
    except:
        print ("Invalid URL given. Which website would you like to crawl?")
        website_url = raw_input("--> ")
# This part is the actual the Web Crawler
# What it does is to search for links
# All the URLs that are not the websites URLs are printed in a txt file named "Non website links"

fake_browser = mechanize.Browser()  # Set the starting point for the spider and initialize the a mechanize browser object
urls = [website_url]  # Create lists for the URLs that the script should go through
visited = [website_url]  # Create lists that we have visited in, to avoid multiplies
text_file = open("Non website links.txt", "w")  # We create a txt file for all the URLs that are not the websites URLs
text_file_url = open("Website links.txt", "w")  # We create a txt file for all the URLs that are the websites URLs
print ("Crawling : " + website_url)
print ("The crawler started at " + time.asctime(time.localtime()) + ". This may take a couple of minutes")  # To let the user know when the crawler started to work
# Since the amount of urls in the list is dynamic we just let the spider go until some last url didn't have new ones on the website
while len(urls) > 0:
    try:
        fake_browser.open(urls[0])
        urls.pop(0)
        for link in fake_browser.links():  # A loop which looking for all the images in the website
            new_website_url = urlparse.urljoin(link.base_url, link.url)  # Create a new url with the websites link that is acceptable as HTTP
            if new_website_url not in visited and website_url in new_website_url:  # If we have been in this website, don't enter the URL to the list, to avoid multiplies
                visited.append(new_website_url)
                urls.append(new_website_url)
                print ("Found: " + new_website_url)  # Print all the links that the crawler found
                text_file_url.write(new_website_url + 'n')  # Print the non-website URL to the txt file
            elif new_website_url not in visited and website_url not in new_website_url:
                visited.append(new_website_url)
                text_file.write(new_website_url + 'n')  # Print the non-website URL to the txt file
    except:
        print ("Link couldn't be opened")
        urls.pop(0)
text_file.close()  # Close the txt file, to prevent anymore writing to it
text_file_url.close()  # Close the txt file, to prevent anymore writing to it
print ("A txt file with all the website links has been created in your folder")
print ("Finished!!")

相关内容

最新更新