我是网络抓取的初学者。最近,我尝试从谷歌SERP的搜索结果中抓取域名。
为了实现这一点,我使用了Request、Beautiful Soup和Regex来获取页面,解析标记并查看href,并使用Regex match来提取域名。
在执行此操作时,输出中缺少一些链接。问题似乎是,当我将提取的文本与Chrome上的源代码进行比较时,请求并没有完全提取页面(缺失的标签存在于缺失的代码中(。我想知道是什么原因!
import requests
from bs4 import BeautifulSoup
import re
url = "https://www.google.com/search?q=glass+beads+india"
r = requests.get(url)
page = r.text
soup = BeautifulSoup(page, 'lxml')
i = 0
link_list = []
for tag in soup.find_all('a'):
i+=1
href = tag['href']
if re.search('http',href):
try:
link = re.search('https://.+.com',href).group(0)
link_list.append(link)
except:
pass
link_list = list(set(link_list))
link_list2 = []
for link in link_list:
if not re.search('google.com',link):
link_list2.append(link)
print(link_list2)
这可能是因为你没有指定user-agent
,也就是请求headers
,因此谷歌会阻止一个请求,你会收到一个带有错误消息或类似内容的页面。检查什么是您的用户代理。
通过user-agent
:
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('YOUR URL', headers=headers)
使用SelectorGadget Chrome扩展查找所有链接以获取CSS
选择器(CSS
选择器参考(:
# container with all needed data
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
displayed_link = result.select_one('.TbwUpd.NJjxre').text
匹配域和子域,不包括">www";部件:
>>> re.findall(r'^(?:https?://)?(?:[^@/n]+@)?(?:www.)?([^:/?n]+)', link)
'etsy.com'
在线IDE中的代码和完整示例:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
'q': 'glass beads india', # search query
'hl': 'en', # language
'num': '100' # number of results
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# container with all needed data
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
displayed_link = result.select_one('.TbwUpd.NJjxre').text
# https://stackoverflow.com/a/25703406/15164646
domain_name = ''.join(re.findall(r'^(?:https?://)?(?:[^@/n]+@)?(?:www.)?([^:/?n]+)', link))
print(link)
print(displayed_link)
print(domain_name)
print('---------------')
'''
https://www.etsy.com/market/india_glass_beads
https://www.etsy.com › market › india_glass_beads
etsy.com
---------------
https://www.etsy.com/market/indian_glass_beads
https://www.etsy.com › market › indian_glass_beads
etsy.com
---------------
https://www.amazon.com/glass-indian-beads/s?k=glass+indian+beads
https://www.amazon.com › glass-indian-beads › k=glass...
amazon.com
---------------
'''
或者,您也可以使用SerpApi的Google Organic Results API实现同样的效果。这是一个付费的API免费计划。
主要区别在于,您只需要从结构化JSON中迭代和提取数据。
要集成的代码:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"), # environment variable
"engine": "google",
"q": "glass beads india",
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
displayed_link = result['displayed_link']
domain_name = ''.join(re.findall(r'^(?:https?://)?(?:[^@/n]+@)?(?:www.)?([^:/?n]+)', link))
print(link)
print(displayed_link)
print(domain_name)
print('---------------')
'''
https://www.etsy.com/market/india_glass_beads
https://www.etsy.com › market › india_glass_beads
etsy.com
---------------
https://www.etsy.com/market/indian_glass_beads
https://www.etsy.com › market › indian_glass_beads
etsy.com
---------------
https://www.amazon.com/glass-indian-beads/s?k=glass+indian+beads
https://www.amazon.com › glass-indian-beads › k=glass...
amazon.com
---------------
'''
免责声明我为SerpApi工作。