Python-通过请求模块在一个页面中进行多个深度级别的Web抓取



我有一个Python3脚本,它基于csv文件中提供的url执行web抓取。我正在努力实现以下目标:

1.(从CSV文件中提供的URL获取页面

2.(抓取它,并使用regex+beautifulsoup搜索电子邮件地址,然后,如果找到电子邮件,将其保存到results.csv文件中

3.(在页面上搜索所有其他(链接(

4.(转到/获取在第一页中找到的所有链接(第一级抓取(并执行相同的

5.(根据用户定义的深度级别执行相同的操作(如果用户说要比它做的更深3个级别:从第一级别获取页面(来自CSV文件的url(,并在该页面上执行所需操作->从第二级获取所有页面(从第一级抓取链接(并做需要的事情->从第三级获取所有页面(从第二级抓取链接(并做需要的事情->等等…

如何创建一个循环来处理深度级别的刮擦?我尝试过玩for和while循环的多种变体,但我无法想出一个有效的解决方案。

这是我目前拥有的代码(目前只能处理第一级抓取(:

from bs4 import BeautifulSoup
import requests
import csv
import re
import time
import sys, os
#Type the amount of max level of depth for this instance of script
while True:
try:
max_level_of_depth = int(input('Max level of depth for webscraping (must be a number - integer): '))
print('Do not open the input and neither the output CSV files before the script finishes!')
break
except:
print('You must type a number (integer)! Try again...n')

#Read the csv file with urls
with open('urls.csv', mode='r') as urls:
#Loop through each url from the csv file
for url in urls:
#Strip the url from empty new lines
url_from_csv_to_scrape = url.rstrip('n')
print('[FROM CSV] Going to ' + url_from_csv_to_scrape)
#time.sleep(3)
i = 1
#Get the content of the webpage
page = requests.get(url_from_csv_to_scrape)
page_content = page.text
soup = BeautifulSoup(page_content, 'lxml')
#Find all <p> tags on the page
paragraphs_on_page = soup.find_all('p')
for paragraph in paragraphs_on_page:
#Search for email address in the 1st level of the page
emails = re.findall(r'[a-zA-Z0-9_-.]+@[a-zA-Z0-9_-.]+.[a-zA-Z]{2,5}', str(paragraph))
#If some emails are found on the webpage, save them to csv
if emails:
with open('results.csv', mode='a') as results:
for email in emails:
print(email)
if email.endswith(('.jpg', '.jpeg', '.png', '.JPG', '.JPEG', '.PNG')):
continue
results.write(url_from_csv_to_scrape + ', ' + email + 'n')
print('Found an email. Saved it to the output file.n')
results.close()
#Find all <a> tags on the page
links_on_page = soup.find_all('a')
#Initiate a list with all links which will later be populated with all found urls to be crawled
found_links_with_href = []
#Loop through all the <a> tags on the page
for link in links_on_page:
try:
#If <a> tag has href attribute
if link['href']:
link_with_href = link['href']
#If the link from the webpage does not have domain and protocol in it, prepend them to it
if re.match(r'https://', link_with_href) is None and re.match(r'http://', link_with_href) is None:
#If the link already has a slash in it, remove it because it will be added after prepending
link_with_href = re.sub(r'/', '', link_with_href)
#Prepend the domain and protocol in front of the link
link_with_href = url_from_csv_to_scrape + link_with_href
#print(link_with_href)
found_links_with_href.append(link_with_href)
found_links_with_href_backup = found_links_with_href
except:
#If <a> tag does not have href attribute, continue
print('No href attribute found, going to next <a> tag...')
continue

非常感谢您的帮助。

谢谢

这里有一些伪代码:

def find_page(page):
new = re.findall('regex', page.text)
new_pages.append(new)
return len(new)
check = True
new_pages = [page]
used_pages = []
while check:
for item in new_pages:
if item not in used_pages:
found = find_page(item)
if found == 0:
check = False
else:
'find emails'

used_pages.append(item)

最新更新