没有使用机械化和美丽汤从谷歌搜索结果中获得正确的链接



我正在使用以下代码片段从谷歌搜索结果中获取我给出的"关键字"的链接。

import mechanize
from bs4 import BeautifulSoup
import re

def googlesearch():
    br = mechanize.Browser()
    br.set_handle_robots(False)
    br.set_handle_equiv(False)
    br.addheaders = [('User-agent', 'Mozilla/5.0')] 
    br.open('http://www.google.com/')   
    # do the query
    br.select_form(name='f')   
    br.form['q'] = 'scrapy' # query
    data = br.submit()
    soup = BeautifulSoup(data.read())
    for a in soup.find_all('a', href=True):
        print "Found the URL:", a['href']
googlesearch()

由于我正在解析搜索结果HTML页面以获取链接。它获得了所有的"a"标签。但我需要的是只获取结果的链接。另一件事是,当您看到 href 属性的输出时,它会给出这样的内容

找到网址: /search?q=scrapy&hl=en-IN&gbv=1&prmd=ivns&source=lnt&tbs=li:1&sa=X&ei=DT8HU9SlG8bskgWvqIHQAQ&ved=0CBgQpwUoAQ

但是href attitube中存在的实际链接是http://scrapy.org/

谁能指出上述两个问题的解决方案?

提前致谢

仅获取结果的链接

您感兴趣的链接位于 h3 标签内(带有 r 类):

<li class="g">
  <h3 class="r">
    <a href="/url?q=http://scrapy.org/&amp;sa=U&amp;ei=XdIUU8DOHo-ElAXuvIHQDQ&amp;ved=0CBwQFjAA&amp;usg=AFQjCNHVtUrLoWJ8XWAROG-a4G8npQWXfQ">
      <b>Scrapy</b> | An open source web scraping framework for Python
    </a>
  </h3>
  ..

您可以使用 css 选择器找到链接:

soup.select('.r a')

获取实际链接

网址采用以下格式:

/url?q=http://scrapy.org/&sa=U&ei=s9YUU9TZH8zTkQWps4BY&ved=0CBwQFjAA&usg=AFQjCNE-2uiVSl60B9cirnlWz2TMv8KMyQ
     ^^^^^^^^^^^^^^^^^^^^

实际网址位于q参数中。

若要获取整个查询字符串,请使用 urlparse.urlparse

>>> url = '/url?q=http://scrapy.org/&sa=U&ei=s9YUU9TZH8zTkQWps4BY&ved=0CBwQFjAA&usg=AFQjCNE-2uiVSl60B9cirnlWz2TMv8KMyQ'
>>> urlparse.urlparse(url).query
'q=http://scrapy.org/&sa=U&ei=s9YUU9TZH8zTkQWps4BY&ved=0CBwQFjAA&usg=AFQjCNE-2uiVSl60B9cirnlWz2TMv8KMyQ'

然后,使用 urlparse.parse_qs 分析查询字符串并提取q参数值:

>>> urlparse.parse_qs(urlparse.urlparse(url).query)['q']
['http://scrapy.org/']
>>> urlparse.parse_qs(urlparse.urlparse(url).query)['q'][0]
'http://scrapy.org/'

最终结果

for a in soup.select('.r a'):
    print urlparse.parse_qs(urlparse.urlparse(a['href']).query)['q'][0]

输出:

http://scrapy.org/
http://doc.scrapy.org/en/latest/intro/tutorial.html
http://doc.scrapy.org/
http://scrapy.org/download/
http://doc.scrapy.org/en/latest/intro/overview.html
http://scrapy.org/doc/
http://scrapy.org/companies/
https://github.com/scrapy/scrapy
http://en.wikipedia.org/wiki/Scrapy
http://www.youtube.com/watch?v=1EFnX1UkXVU
https://pypi.python.org/pypi/Scrapy
http://pypix.com/python/build-website-crawler-based-upon-scrapy/
http://scrapinghub.com/scrapy-cloud
或者

你可以使用基本上做同样事情的 https://code.google.com/p/pygoogle/。

您还可以获得指向结果的链接。

"stackoverflow"示例查询的输出片段:

*Found 3940000 results*
[Stack Overflow]
Stack Overflow is a question and answer site for professional and enthusiast 
programmers. It's 100% free, no registration required. Take the 2-minute tour
http://stackoverflow.com/

在您的代码示例中,您从 HTML 中提取了所有<a>标记,不仅与自然结果相关:

for a in soup.find_all('a', href=True):
    print "Found the URL:", a['href']

您正在寻找它只是为了从自然搜索结果中抓取链接:

# container with needed data: title, link, etc.
for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
<小时 />

在线 IDE 中的代码和示例:

from bs4 import BeautifulSoup
import requests, lxml
headers = {
    'User-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
params = {
  'q': 'minecraft',
  'gl': 'us',
  'hl': 'en',
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
  print(link)
---------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''
<小时 />

或者,您可以通过使用SerpApi的Google Organic Results API来实现相同的目标。这是一个带有免费计划的付费 API。

不同之处在于,您不必从头开始制作所有内容、绕过块并随着时间的推移维护解析器。

要集成以实现目标的代码:

import os
from serpapi import GoogleSearch
params = {
  "engine": "google",
  "q": "minecraft",
  "hl": "en",
  "gl": "us",
  "api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
  print(result['link'])
---------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''

免责声明,我为 SerpApi 工作。

最新更新