我对自动抓取网页感兴趣,比如https://www.hltv.org/team/7532/big.更确切地说,我想从你把鼠标挂在绘图上时显示的框中提取日期和#排名(见下面的屏幕截图(
我试着将python与selenium结合使用,但我真的不知道如何继续,尽管我经历了不同的教程。我觉得我需要更改style属性的顶部和左侧值,但我不知道如何做,也不知道是否应该使用xpath、css选择器或其他任何东西。以下是我的一段代码,它返回了我感兴趣的WebElement(大概是(,但我没有从中提取任何内容:(
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
executable_path=r'C:/Users/fabbe/Documents/Python Scripts/hltv/chromedriver/chromedriver.exe'
driver = webdriver.Chrome(executable_path, chrome_options=options)
driver.get("https://www.hltv.org/team/7532/big")
elements = driver.find_elements_by_xpath("//*[@id='fusioncharts-tooltip-element']")
屏幕截图
我会采用另一种方法来获取图形数据,这样你就不必将鼠标悬停在图形的所有部分上。
您必须添加以下导入。
import json
from lxml import html
代码:
url = "https://www.hltv.org/team/7532/BIG"
driver.get(url)
graph_data = driver.find_element_by_css_selector('.chart-container.core-chart-container .border-box .graph').get_attribute('data-fusionchart-config')
graph_text = json.loads(graph_data)['dataSource']['dataset'][0]['data']
for graph_item in graph_text:
tree = html.fromstring(graph_item['tooltext'])
print("Date:" + tree.xpath("//div[@class='subtitle']//text()")[0])
print("Rank:" + tree.xpath("(//div[@class='ranking-development-top-info']//div[@class='title'])[2]//text()")[0])
driver.close()
下面是获取图形内容,然后对其进行解析。然后只获取我们感兴趣的数据,并遍历所有图形项。
下面是输出。
Date:24th December 2018
Rank:#11
Date:31st December 2018
Rank:#11
Date:7th January 2019
Rank:#11
Date:14th January 2019
Rank:#12
Date:21st January 2019
Rank:#13
Date:28th January 2019
Rank:#13
Date:4th February 2019
Rank:#15
Date:11th February 2019
Rank:#12
Date:18th February 2019
Rank:#14
Date:25th February 2019
Rank:#15
Date:4th March 2019
Rank:#18
Date:11th March 2019
Rank:#16
Date:18th March 2019
Rank:#18
Date:25th March 2019
Rank:#18
Date:1st April 2019
Rank:#18
Date:8th April 2019
Rank:#18
Date:15th April 2019
Rank:#18
Date:22nd April 2019
Rank:#19
Date:29th April 2019
Rank:#19
Date:6th May 2019
Rank:#18
Date:13th May 2019
Rank:#18
Date:20th May 2019
Rank:#20
Date:27th May 2019
Rank:#22
Date:3rd June 2019
Rank:#22
Date:10th June 2019
Rank:#22
Date:17th June 2019
Rank:#26
Date:24th June 2019
Rank:#30
Date:1st July 2019
Rank:#34
Date:8th July 2019
Rank:#23
Date:15th July 2019
Rank:#27
Date:22nd July 2019
Rank:#22
Date:29th July 2019
Rank:#23
Date:5th August 2019
Rank:#28
Date:12th August 2019
Rank:#25
Date:19th August 2019
Rank:#24
Date:26th August 2019
Rank:#26
Date:2nd September 2019
Rank:#28
Date:9th September 2019
Rank:#24
Date:16th September 2019
Rank:#22
Date:23rd September 2019
Rank:#22
Date:30th September 2019
Rank:#21
Date:7th October 2019
Rank:#27
Date:14th October 2019
Rank:#24
Date:21st October 2019
Rank:#26
Date:28th October 2019
Rank:#24
Date:4th November 2019
Rank:#24
Date:11th November 2019
Rank:#24
Date:18th November 2019
Rank:#28
Date:25th November 2019
Rank:#26
Date:2nd December 2019
Rank:#26
Date:9th December 2019
Rank:#29
Date:16th December 2019
Rank:#33
Date:23rd December 2019
Rank:#40
Date:30th December 2019
Rank:#39
Date:6th January 2020
Rank:#46
Date:13th January 2020
Rank:#46
Date:20th January 2020
Rank:#46
Date:27th January 2020
Rank:#22
Date:3rd February 2020
Rank:#22
Date:10th February 2020
Rank:#23
Date:17th February 2020
Rank:#25
Date:24th February 2020
Rank:#26
Date:2nd March 2020
Rank:#21
Date:9th March 2020
Rank:#20