我有一个基本的HTTP URL和其他HTTP URL的列表。我正在编写一个简单的爬虫/链接检查器作为研究(因此,不需要建议预先编写的工具),检查基础URL是否有任何断开的链接,并递归地爬行所有其他"内部"页面(即。从同一站点内的基本URL链接的页面)具有相同的意图。最后,我必须输出带有其状态(外部/内部)的链接列表,并为每个实际上是内部但以绝对url形式呈现的链接发出警告。
到目前为止,我成功地检查了所有的链接,并使用请求和BeautifulSoup库进行爬行,但我找不到一个已经写好的方法来检查两个绝对url是否指向同一个网站(除了沿着斜线分割url,这对我来说似乎很丑)。这里有著名的图书馆吗?
最后我选择了urlparse(感谢@padraic-cunningham给我指出了它)。在代码的开头,我解析"基础URL"(即。我开始爬的那个):
base_parts = urlparse.urlparse(base_url)
则对于我找到的每个链接(例如for a in soup.find_all('a'):
link_parts = urlparse.urlparse(a.get('href'))
在这一点上,我必须比较URL方案(我考虑链接到同一网站与不同的URL方案,http或https,不同;以后我可能会让这个比较成为可选的):
internal = base_parts.scheme == link_parts.scheme
and base_parts.netloc == link_parts.netloc
,到这里,内部将是True
,如果链接指向相同的服务器(具有相同的方案)作为我的基础URL。您可以在这里查看最终结果
我自己写了一个爬虫。我希望这对你有帮助。基本上,我所做的是将url添加到/2/2/3/index.php这样的网站,这将使网站成为http://www.website.com/2/2/3/index.php。然后我将所有的网站插入到一个数组中,这个数组检查我之前是否访问过这个网站,如果我访问过,它就不会访问那里。此外,如果这个网站有一些不相关的网站,比如一个youtube视频的链接,那么它将不会抓取youtube或任何其他不"网站相关"的网站。
对于你的问题,我建议你把所有访问过的网站放在一个数组中,并用For循环检查数组。如果URL与数组相同,则打印它。
我不确定这是你想要的,但至少我尝试了。我不使用BeautifulSoup,它仍然有效,所以考虑把这个模块放在一边。
我的脚本(更像是它的一部分。我也有异常检查,所以不要惊慌):
__author__ = "Sploit"
# This part is about import the default python modules and the modules that the user have to download
# If the module does not exist, the script asks him to install that specific module
import os # This module provides a portable way of using operating system dependent functionality
import urllib # The urllib module provides a simple interface for network resource access
import urllib2 # The urllib2 module provides a simple interface for network resource access
import time # This module provides various time-related functions
import urlparse # This module defines a standard interface to break URL strings up in components
# to combine the components back into a URL string, and to convert a relative URL to an absolute URL given a base URL.
import mechanize
print ("Which website would you like to crawl?")
website_url = raw_input("--> ")
# Ads http:// to the given URL because it is the only way to check for server response
# If the user will add to the URL directions then they will be deleted
# Example: 'https://moz.com/learn/seo/external-link' will turn to 'https://moz.com/'
if website_url.split('//')[0] != 'http:' and website_url.split('//')[0] != 'https:':
website_url = 'http://' + website_url
website_url = website_url.split('/')[0] + '//' + website_url.split('/')[2]
# The user will stuck in a loop until a valid website is exist, using the application layer of the OSI module, HTTP Protocol
while True:
try:
if urllib2.urlopen(website_url).getcode() != 200:
print ("Invalid URL given. Which website would you like to crawl?")
website_url = raw_input("--> ")
else:
break
except:
print ("Invalid URL given. Which website would you like to crawl?")
website_url = raw_input("--> ")
# This part is the actual the Web Crawler
# What it does is to search for links
# All the URLs that are not the websites URLs are printed in a txt file named "Non website links"
fake_browser = mechanize.Browser() # Set the starting point for the spider and initialize the a mechanize browser object
urls = [website_url] # Create lists for the URLs that the script should go through
visited = [website_url] # Create lists that we have visited in, to avoid multiplies
text_file = open("Non website links.txt", "w") # We create a txt file for all the URLs that are not the websites URLs
text_file_url = open("Website links.txt", "w") # We create a txt file for all the URLs that are the websites URLs
print ("Crawling : " + website_url)
print ("The crawler started at " + time.asctime(time.localtime()) + ". This may take a couple of minutes") # To let the user know when the crawler started to work
# Since the amount of urls in the list is dynamic we just let the spider go until some last url didn't have new ones on the website
while len(urls) > 0:
try:
fake_browser.open(urls[0])
urls.pop(0)
for link in fake_browser.links(): # A loop which looking for all the images in the website
new_website_url = urlparse.urljoin(link.base_url, link.url) # Create a new url with the websites link that is acceptable as HTTP
if new_website_url not in visited and website_url in new_website_url: # If we have been in this website, don't enter the URL to the list, to avoid multiplies
visited.append(new_website_url)
urls.append(new_website_url)
print ("Found: " + new_website_url) # Print all the links that the crawler found
text_file_url.write(new_website_url + 'n') # Print the non-website URL to the txt file
elif new_website_url not in visited and website_url not in new_website_url:
visited.append(new_website_url)
text_file.write(new_website_url + 'n') # Print the non-website URL to the txt file
except:
print ("Link couldn't be opened")
urls.pop(0)
text_file.close() # Close the txt file, to prevent anymore writing to it
text_file_url.close() # Close the txt file, to prevent anymore writing to it
print ("A txt file with all the website links has been created in your folder")
print ("Finished!!")