如何在Python中使用Selenium从th和td标记中提取数据



所以我正试图从DOL网站上为一个使用硒和python的项目收集数据。我正在尝试将列数据合并到一个数据帧中。问题是,前两列是在<th>标记下编码的,因此在尝试提取这些数据时,xpath命令不起作用。我真的需要帮助。我一直在绞尽脑汁,到处找,找不到解决这个问题的地方。请帮忙。

<tr>
<th id="Alabama" align="left">Alabama</th>
<th id="01/04/2020" align="right">01/04/2020</th>
<td headers="Alabama 01/04/2020 initial_claims" align="right">4,578</td>
<td headers="Alabama 01/04/2020 reflecting_week_ended" align="right">12/28/2019</td>
<td headers="Alabama 01/04/2020 continued_claims" align="right">18,523</td>
<td headers="Alabama 01/04/2020 covered_employment" align="right">1,923,741</td>
<td headers="Alabama 01/04/2020 insured_unemployment" align="right">0.96</td>
</tr>
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.action_chains import ActionChains

url = 'https://oui.doleta.gov/unemploy/claims.asp'
driver = webdriver.Chrome(executable_path=r"C:Program Files (x86)chromedriver.exe")

driver.implicitly_wait(10)
driver.get(url)
driver.find_element_by_css_selector('input[name="level"][value="state"]').click()
Select(driver.find_element_by_name('strtdate')).select_by_value('2020')
Select(driver.find_element_by_name('enddate')).select_by_value('2022')
driver.find_element_by_css_selector('input[name="filetype"][value="html"]').click()
select = Select(driver.find_element_by_id('states'))
# Iterate through and select all states
for opt in select.options:
opt.click()
input('Press ENTER to submit the form')
driver.find_element_by_css_selector('input[name="submit"][value="Submit"]').click()
headers = []
heads = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr[2]/th')
#Collect headers
for h in heads:
headers.append(h.text)
rows = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr')

# Get row count
row_count = len(rows) 
cols = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr[3]/th/td')
# Get column count
col_count = len(cols)

我试过这个代码

cols = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr[3]/th' and '//* [@id="content"]/table/tbody/tr[3]/td')

如建议的那样。然而,它仍然只拉5列,但正如您从上面的HTML中看到的,有7列。我需要它们。请帮忙?

您可以在xpath中使用*name()从所有7列中提取数据。xpath如下所示。

rows = driver.find_elements_by_xpath("//table/tbody/tr")
cols = row.find_elements_by_xpath("./*") # Gets all the columns element within the element row. Use a Dot in the xpath to find elements within an element.
Or 
cols = row.find_elements_by_xpath("./*[name()='th' or name()='td']") # Gets all the column elements with tag name "th" or "td" within the element row.

尝试如下:

# Get the rows
rows = driver.find_elements_by_xpath("//table/tbody/tr")
# Iterate over the rows
for row in rows:
# Get all the columns for each row. 
# cols = row.find_elements_by_xpath("./*")
cols = row.find_elements_by_xpath("./*[name()='th' or name()='td']")
temp = [] # Temproary list
for col in cols:
temp.append(col.text)
print(temp)
['']
['State', 'Filed week ended', 'Initial Claims', 'Reflecting Week Ended', 'Continued Claims', 'Covered Employment', 'Insured Unemployment Rate']
['Alabama', '01/04/2020', '4,578', '12/28/2019', '18,523', '1,923,741', '0.96']
['Alabama', '01/11/2020', '3,629', '01/04/2020', '21,143', '1,923,741', '1.10']
['Alabama', '01/18/2020', '2,483', '01/11/2020', '17,402', '1,923,741', '0.90']
...

要从<th><td>标签中抓取数据,您可以使用列表理解,还可以使用以下定位器策略:

  • 代码块:

    driver.get("https://oui.doleta.gov/unemploy/claims.asp")
    WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,"input[value='state']"))).click()
    Select(driver.find_element_by_name('strtdate')).select_by_value('2020')
    Select(driver.find_element_by_name('enddate')).select_by_value('2022')
    Select(driver.find_element_by_id('states')).select_by_visible_text('Alabama')
    driver.find_element_by_css_selector("input[value='Submit']").click()
    # To print all the texts from the first row
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table[summary*='Report Table'] tbody tr:nth-child(3)"))).text)
    print("*****")
    # To create a List with all the texts from the first row using List Comprehension
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table[summary*='Report Table'] tbody tr:nth-child(3) [align='right']")))])
    driver.quit()
    
  • 控制台输出:

    Alabama 01/04/2020 4,578 12/28/2019 18,523 1,923,741 0.96
    *****
    ['01/04/2020', '4,578', '12/28/2019', '18,523', '1,923,741', '0.96']
    

最新更新