使用beautifulsoup和selenium抓取多页网站返回空字符串列表



我想迭代地从网站上抓取文本。该网页的每个页面都具有相同的html结构。每次添加以下字符串时,我使用selenium导航到下一页:text_i_want1,text_i_wantA,text_i_wantB,text_i_wantC

[<div class="col-12">
<a href="/url" target="_blank" title="ad i">
text_i_want1
</a>
</div>, 
<div class="col-12">
<div class="row">
<div>
date: text_i_wantA
</div>
</div> 

<div class="row">
<div>
source: text_i_wantB
</div>
</div>


<div class="row">
<div>
number: text_i_wantC

<span class="processlink">
<a href="url" title="text_i_dont_want">
text_i_dont_want
</a>
</span>

</div>

</div>


</div>,
<div class="col-12">
<a href="/url" target="_blank" title="ad i">
text_i_want2
</a>
</div>, 
<div class="col-12">
<div class="row">
<div>
date: text_i_wantAA
</div>
</div> 

<div class="row">
<div>
source: text_i_wantBB
</div>
</div>


<div class="row">
<div>
number: text_i_wantCC

<span class="processlink">
<a href="/url" title="text_i_dont_want">
text_i_dont_want
</a>
</span>

</div>

</div>


</div>,
<div class="col-12">
<a href="/url" target="_blank" title="ad i">
text_i_want3
</a>
</div>, 
<div class="col-12">
<div class="row">
<div>
date: text_i_wantAAA
</div>
</div> 

<div class="row">
<div>
source: text_i_wantBBB
</div>
</div>


<div class="row">
<div>
number: text_i_wantCCC

<span class="processlink">
<a href="/url" title="text_i_dont_want">
text_i_dont_want
</a>
</span>

</div>

</div>


</div>, 
<div class="col-12">
.  
. 
. 
. 
</div>]

因为text_i_want1text_i_wantA,text_i_wantB,text_i_wantC不在同一个div中,所以我使用了beautifulsoup来获取所有<div class="col-12">但将[1::2]的输出切片,以便每秒只迭代<div class="col-12">以获得text_i_wantA,text_i_wantB,text_i_wantC。为了便于阅读,下面我每页只包含了3个其他相同结构的<div class="col-12">

title,date,name,number = [],[],[],[]
while True:
soup = bs(driver.page_source, 'html5lib')
for div in soup.find_all('a', attrs={'title':'ad i'}):
titl = div.get_text(strip=True)
title.append(titl)
else:
break
for col in soup.find_all('div', attrs={'class':'col-12'})[1::2]:
row = []
for entry in col.select('div.row div'):
target = entry.find_all(text=True, recursive=False)
row.append(target[0].strip())
name.append(row[0])
date.append(row[1])
number.append(row[2])  
next_btn = driver.find_elements_by_css_selector(".page-next button")
if next_btn:
actions = ActionChains(driver)
actions.move_to_element(next_btn[0]).click().perform()
time.sleep(4)
else:
break
driver.close()

预期输出:

title = ["text_i_want1", "text_i_want2", ...]
date = ["text_i_wantA", "text_i_wantAA", ...]
name = ["text_i_wantB", "text_i_wantBB", ...]
number = ["text_i_wantC", "text_i_wantCC", ...]
问题:实际输出

title = ["text_i_want1", "text_i_want2", ...]
date = ['text_i_wantA', 'text_i_wantAA', ...]
name = ['', '', '', '', '', '', '', '', '', '']
number = ['', '', '', '', '', '', '', '', '', '']

为什么namenumber为空,在html中有字符值。是css的问题还是循环本身的问题?

.........................................................................................................................................................................................................

<更新问题/strong>:集成

DRIVER_PATH = 'chromedriver.exe'
options = webdriver.ChromeOptions()
options.add_argument("--no-sandbox")
prefs = {"profile.default_content_settings.popups": 0,
"download.default_directory": r"C:Usersaaa",
"directory_upgrade": True,
"plugins.always_open_pdf_externally": True}
options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome(executable_path=DRIVER_PATH, options=options)
driver.get('https://parldok.thueringen.de/ParlDok/formalkriterien')
driver.maximize_window()
try:
selenium.webdriver.support.ui.WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, 'LegislaturperiodenList-button')))
driver.execute_script("document.getElementById('LegislaturperiodenList').style.display='inline-block';")
element = selenium.webdriver.support.ui.WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, 'LegislaturperiodenList')))
selenium.webdriver.support.ui.Select(element).select_by_value('7')
except Exception as ex:
print(ex)
try:
selenium.webdriver.support.ui.WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, 'LegislaturperiodenList-button')))
driver.execute_script("document.getElementById('DokumententypId').style.display='inline-block';")
element = selenium.webdriver.support.ui.WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, 'DokumententypId')))
selenium.webdriver.support.ui.Select(element).select_by_value('10')
except Exception as ex:
print(ex)
driver.find_element_by_css_selector('button[class="btn btn-primary"][type="submit"]').click()

这就是我如何设置硒,以便能够导航到下一页。你能帮我把东西整理一下吗?我不知道如何把你的方法和硒结合起来。

更新答案:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from math import ceil

allin = []

def parser(soup):
goal = (
(
x.select('a')[1].get_text(strip=True),
*list(x.select('.col-12')[1].stripped_strings)[:-1]
)
for x in soup.select('.row.tlt_search_result'))
allin.append(pd.DataFrame(goal))

def main(url):
with requests.Session() as req:
data = {
"LegislaturPeriodenNummer": "7",
"UrheberPersonenId": "",
"UrheberSonstigeId": "",
"DokumententypId": "10",
"BeratungsstandId": "",
"Datum": "",
"DatumVon": "",
"DatumBis": ""
}
r = req.post(url, data=data)
soup = BeautifulSoup(r.text, 'lxml')
print("Extracting Page# 1")
parser(soup)
try:
nextpage = int(soup.select_one(
'.pd_resultcount').contents[0].split()[-1]) / 10
for page in range(2, ceil(nextpage) + 1):
print(f"Extracting Page# {page}")
r = req.get(f"{url}/{page}")
soup = BeautifulSoup(r.text, 'lxml')
parser(soup)
except AttributeError:
print('No More Result Found!')

if __name__ == "__main__":
main('https://parldok.thueringer-landtag.de/ParlDok/formalkriterien')
final = pd.concat(allin, ignore_index=True)
print(final)
final.to_csv('data.csv', index=False)

输出:

0  ...                       3
0     GRW-Fördermittelanträge eines Fertigteil-Herst...  ...  Dokumentnummer: 7/2303
1     Vertretung der Menschen mit Behinderungen in T...  ...  Dokumentnummer: 7/2307
2     Rassistische und rechtsextremistische Aktivitä...  ...  Dokumentnummer: 7/2306
3     Antisemitische Überfälle, Leugnung des Holocau...  ...  Dokumentnummer: 7/2302
4     Finanzierung von Kindertagesstätten in Thüring...  ...  Dokumentnummer: 7/2301
...                                                 ...  ...                     ...
2299               NaturFreunde Thüringen e.V. - Teil I  ...     Dokumentnummer: 7/6
2300  Aktuelle Sicherheitslage für Thüringer Kunst- ...  ...     Dokumentnummer: 7/5
2301  Stand der Planungen zur Ortsumgehung der Stadt...  ...     Dokumentnummer: 7/3
2302  Übergangsbestimmungen zur Neuordnung der Organ...  ...     Dokumentnummer: 7/2
2303  Baustellen entlang der Autobahn 71 zwischen de...  ...     Dokumentnummer: 7/1
[2304 rows x 4 columns]
import requests
from bs4 import BeautifulSoup
import pandas as pd

def main(url):
data = {
"LegislaturPeriodenNummer": "7",
"UrheberPersonenId": "",
"UrheberSonstigeId": "",
"DokumententypId": "10",
"BeratungsstandId": "",
"Datum": "",
"DatumVon": "",
"DatumBis": ""
}
r = requests.post(url, data=data)
soup = BeautifulSoup(r.text, 'lxml')
goal = (
(
x.select('a')[1].get_text(strip=True),
*list(x.select('.col-12')[1].stripped_strings)[:-1]
)
for x in soup.select('.row.tlt_search_result'))
df = pd.DataFrame(goal)
print(df)

main('https://parldok.thueringer-landtag.de/ParlDok/formalkriterien')

输出:

0  ...                       3
0  GRW-Fördermittelanträge eines Fertigteil-Herst...  ...  Dokumentnummer: 7/2303
1  Vertretung der Menschen mit Behinderungen in T...  ...  Dokumentnummer: 7/2307
2  Rassistische und rechtsextremistische Aktivitä...  ...  Dokumentnummer: 7/2306
3  Antisemitische Überfälle, Leugnung des Holocau...  ...  Dokumentnummer: 7/2302
4  Finanzierung von Kindertagesstätten in Thüring...  ...  Dokumentnummer: 7/2301
5        Ausstattung der unteren Naturschutzbehörden  ...  Dokumentnummer: 7/2300
6  Antifa-Szene, insbesondere das Arnstädter "Akt...  ...  Dokumentnummer: 7/2291
7  Finanzierung der Beschaffung von Ausrüstung, A...  ...  Dokumentnummer: 7/2309
8                       Statistik der Kfz-Diebstähle  ...  Dokumentnummer: 7/2308
9  Unterstützung des Freistaats Thüringen für Sta...  ...  Dokumentnummer: 7/2299
[10 rows x 4 columns]

最新更新