使用Python搜索谷歌并将网站存储到变量中



我正在寻找一种使用 python 搜索谷歌并将每个网站存储在数据列表中的插槽中的方法。我正在寻找类似于以下示例代码的东西。

search=input('->')
results=google.search((search),(10))
print results

在这种情况下,我希望它在谷歌上搜索变量"search"中的任何内容,10 是我想存储在变量中的结果量,最后将它们与"打印结果"一起放在屏幕上。

我将不胜感激任何帮助或类似我想要的东西。谢谢。

如上所述,谷歌确实提供了一个用于完成搜索(https://developers.google.com/custom-search/json-api/v1/overview)的api,并且如前所述,根据您要完成的任务,可能会变得非常昂贵。另一种选择是废弃谷歌页面。下面是我使用美丽汤(https://www.crummy.com/software/BeautifulSoup/bs4/doc/#)创建的一个例子,以废弃谷歌结果。

导入网址库2 import xml.etree.ElementTree 从 bs4 导入美丽汤 #install 使用"点安装美丽汤4">

'''
Since spaces will not work in url parameters, the spaces have to be converted int '+'
ex) "example text" -> "example+text"  
'''
def spacesToPluses(string):
words = string.split(" ")
convertedString = ""
for i in range(0, len(words)):
convertedString += words[i] + "+"
return convertedString[0:len(convertedString)-1]
'''
Opens the url with the parameter included and reads it as a string
'''
def getRawGoogleResponse(url):
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7' 
headers={'User-Agent':user_agent,} #Required for google to allow url request
request=urllib2.Request(url,None,headers) 
response = urllib2.urlopen(request)
rawResponse = response.read()
return rawResponse
'''
Takes in the raw string representation and converts it into an easier to navigate object (Beautiful Soup)
'''
def getParsedGoogleResponse(url):
rawResponse = getRawGoogleResponse(url)
fullPage = BeautifulSoup(rawResponse, 'html.parser')
return fullPage
'''
Finds all of the urls on a single page
'''
def getGoogleResultsOnPage(fullPage):
searchResultContainers = fullPage.find_all("h3", {"class": "r"}) #the results are contained in an h3 element with the class 'r'
pageUrls = []
for container in searchResultContainers: #get each link in the container
fullUrl = container.find('a')['href']
beginningOfUrl = fullUrl.index('http')
pageUrls.append(fullUrl[beginningOfUrl:])#Chops off the extra bits google adds to the url
return pageUrls
'''
Returns number of pages (max of 10)
'''
def getNumPages(basePage):
navTable = basePage.find("table", {"id" : "nav"}) #The nav table contains the number of pages (up to 10)
pageNumbers = navTable.find_all("a", {"class" : "fl"})
lastPageNumber = int(pageNumbers[len(pageNumbers)-2].text)
return lastPageNumber
'''
Loops through pages gathering url from each page
'''
def getAllGoogleSearchResults(search, numResults):
baseUrl = "https://www.google.com/search?q=" + spacesToPluses(search)
basePage = getParsedGoogleResponse(baseUrl)
numPages = getNumPages(basePage)
allUrls = []
for i in range(0, numPages):
completeUrl = baseUrl + "&start=" + str(i * 10) #google uses the parameter 'start' to represent the url to start at (10 urls pre page)
page = getParsedGoogleResponse(completeUrl)
for url in getGoogleResultsOnPage(page):
allUrls.append(url)
return allUrls[0:numResults]#return just the number of results

def main():
print(getAllGoogleSearchResults("even another test", 1))
main()

该解决方案适用于谷歌结果的前 10 页(或第二高页)。URL 在字符串对象数组中返回。通过使用 urllib2 获取响应来报废信息。希望这有帮助。

Google 搜索页面最多返回 10 个结果(默认情况下),参数num参数dict对此负责:

params = {
"q": query,          # query
"hl": "en",          # language
"gl": "us",          # country of the search, US -> USA
"start": 0,          # number page by default up to 0
#"num": 100          # parameter defines the maximum number of results to return.
}

要获取更多数据,您可以使用无限while循环对所有页面进行分页。只要下一个按钮存在,就可以分页(由页面上是否存在按钮选择器决定,在我们的例子中是 CSS 选择器.d6cvqb a[id=pnnext]),您需要将["start"]的值增加 10 才能访问下一页,如果存在,否则,我们需要退出 while 循环:

if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break

您还需要注意这样一个事实,即包括Google在内的大多数网站都不喜欢被抓取,并且如果您使用requests作为默认库user-agent in requests则请求可能会被阻止python-requests。其他步骤可能是 旋转user-agent,例如,在PC,手机和平板电脑之间切换,以及在浏览器(例如Chrome,Firefox,Safari,Edge等)之间切换。

检查在线 IDE 中的代码。

from bs4 import BeautifulSoup
import requests, json, lxml

query = input("Input your query: ")    # for example: "auto"
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": query,          # query
"hl": "en",          # language
"gl": "us",          # country of the search, US -> USA
"start": 0,          # number page by default up to 0
#"num": 100          # parameter defines the maximum number of results to return.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
page_num = 0
website_data = []
while True:
page_num += 1
print(f"page: {page_num}")

html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select(".tF2Cxc"):
title = f'Title: {result.select_one("h3").text}'
link = f'Link: {result.select_one("a")["href"]}'

website_data.append({
"title" : title,
"link" : link,   
})

if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
print(json.dumps(website_data, indent=2, ensure_ascii=False))

示例输出:

[
{
"title": "Title: Show Your Auto",
"link": "Link: http://www.showyourauto.com/vehicles/388/2002-ford-saleen-mustang-s281-extreme-speedster"
},
{
"title": "Title: Global Competition in the Auto Parts Industry: Hearings ...",
"link": "Link: https://books.google.com/books?id=dm7bjDjkrRQC&pg=PA2&lpg=PA2&dq=auto&source=bl&ots=sIf4ELozPN&sig=ACfU3U3xea1-cJl9hiQe8cpac2KLrIF20g&hl=en&sa=X&ved=2ahUKEwjWn7ukv6P7AhU3nGoFHSRxABY4jgIQ6AF6BAgEEAM"
},
{
"title": "Title: Issues relating to the domestic auto industry: hearings ...",
"link": "Link: https://books.google.com/books?id=fHX_MJobx3EC&pg=PA79&lpg=PA79&dq=auto&source=bl&ots=jcrwR-jwck&sig=ACfU3U0p0Wn6f-RU11U8Z0GtqMjTKd44ww&hl=en&sa=X&ved=2ahUKEwjWn7ukv6P7AhU3nGoFHSRxABY4jgIQ6AF6BAgaEAM"
},
# ...
]

您还可以使用SerpApi的Google Search Engine Results API。这是一个带有免费计划的付费 API。 不同之处在于它将绕过来自Google的块(包括CAPTCHA),无需创建解析器并维护它。

代码示例:

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os
query = input("Input your query: ")  # for example: "auto"
params = {
"api_key": os.getenv("API_KEY"),   # serpapi key
"engine": "google",                # serpapi parser engine
"q": query,                        # search query
"num": "100"                       # number of results per page (100 per page in this case)
# other search parameters: https://serpapi.com/search-api#api-parameters
}
search = GoogleSearch(params)      # where data extraction happens
organic_results_data = []
page_num = 0
while True:
results = search.get_dict()    # JSON -> Python dictionary

page_num += 1

for result in results["organic_results"]:
organic_results_data.append({
"page_num": page_num,
"title": result.get("title"),
"link": result.get("link"),
"displayed_link": result.get("displayed_link"),   
})

if "next_link" in results.get("serpapi_pagination", []):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
else:
break

print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))

输出:

[
{
"page_num": 4,
"title": "San Francisco's JFK Drive to Remain Closed to Cars",
"link": "https://www.nbcbayarea.com/decision-2022/jfk-drive-san-francisco-election/3073470/",
"displayed_link": "https://www.nbcbayarea.com › decision-2022 › jfk-driv..."
},
{
"page_num": 4,
"title": "Self-Driving Cruise Cars Are Expanding to Most of SF, Says ...",
"link": "https://sfstandard.com/business/self-driving-cruise-cars-are-expanding-to-most-of-sf-says-ceo/",
"displayed_link": "https://sfstandard.com › business › self-driving-cruise-c..."
},
# ...
]

最新更新