无法通过网络抓取从谷歌搜索页面抓取所有链接



我是网络抓取的初学者。最近,我尝试从谷歌SERP的搜索结果中抓取域名。

为了实现这一点,我使用了Request、Beautiful Soup和Regex来获取页面,解析标记并查看href,并使用Regex match来提取域名。

在执行此操作时,输出中缺少一些链接。问题似乎是,当我将提取的文本与Chrome上的源代码进行比较时,请求并没有完全提取页面(缺失的标签存在于缺失的代码中(。我想知道是什么原因!

import requests
from bs4 import BeautifulSoup
import re
url = "https://www.google.com/search?q=glass+beads+india"
r = requests.get(url)
page = r.text 
soup = BeautifulSoup(page, 'lxml') 
i = 0
link_list = []
for tag in soup.find_all('a'):
i+=1
href = tag['href']
if re.search('http',href):
try:
link = re.search('https://.+.com',href).group(0)
link_list.append(link)
except:
pass
link_list = list(set(link_list))
link_list2 = [] 
for link in link_list:
if not re.search('google.com',link):
link_list2.append(link)

print(link_list2)

这可能是因为你没有指定user-agent,也就是请求headers,因此谷歌会阻止一个请求,你会收到一个带有错误消息或类似内容的页面。检查什么是您的用户代理。

通过user-agent:

headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('YOUR URL', headers=headers)

使用SelectorGadget Chrome扩展查找所有链接以获取CSS选择器(CSS选择器参考(:

# container with all needed data
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
displayed_link = result.select_one('.TbwUpd.NJjxre').text

匹配域和子域,不包括">www";部件:

>>> re.findall(r'^(?:https?://)?(?:[^@/n]+@)?(?:www.)?([^:/?n]+)', link)
'etsy.com'

在线IDE中的代码和完整示例:

import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
'q': 'glass beads india',  # search query
'hl': 'en',                # language
'num': '100'               # number of results
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# container with all needed data
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
displayed_link = result.select_one('.TbwUpd.NJjxre').text
# https://stackoverflow.com/a/25703406/15164646
domain_name = ''.join(re.findall(r'^(?:https?://)?(?:[^@/n]+@)?(?:www.)?([^:/?n]+)', link))
print(link)
print(displayed_link)
print(domain_name)
print('---------------')

'''
https://www.etsy.com/market/india_glass_beads
https://www.etsy.com › market › india_glass_beads
etsy.com
---------------
https://www.etsy.com/market/indian_glass_beads
https://www.etsy.com › market › indian_glass_beads
etsy.com
---------------
https://www.amazon.com/glass-indian-beads/s?k=glass+indian+beads
https://www.amazon.com › glass-indian-beads › k=glass...
amazon.com
---------------
'''

或者,您也可以使用SerpApi的Google Organic Results API实现同样的效果。这是一个付费的API免费计划。

主要区别在于,您只需要从结构化JSON中迭代和提取数据。

要集成的代码:

from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"), # environment variable
"engine": "google",
"q": "glass beads india",
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
displayed_link = result['displayed_link']
domain_name = ''.join(re.findall(r'^(?:https?://)?(?:[^@/n]+@)?(?:www.)?([^:/?n]+)', link))

print(link)
print(displayed_link)
print(domain_name)
print('---------------')

'''
https://www.etsy.com/market/india_glass_beads
https://www.etsy.com › market › india_glass_beads
etsy.com
---------------
https://www.etsy.com/market/indian_glass_beads
https://www.etsy.com › market › indian_glass_beads
etsy.com
---------------
https://www.amazon.com/glass-indian-beads/s?k=glass+indian+beads
https://www.amazon.com › glass-indian-beads › k=glass...
amazon.com
---------------
'''

免责声明我为SerpApi工作。

最新更新