如何从列表中跳过元素



我试图找出如何将图像id添加到列表并在下一次搜索中跳过它。到目前为止,这是我的代码,我尝试了很多…机器人应该将最近复制的图片添加到"已使用"黑名单中,并且下次不要复制它。

search = True
used = []
driver = webdriver.Chrome()
driver.get('https://9gag.com/funny')
time.sleep(2)
driver.find_element(By.XPATH,value='//*[@id="qc-cmp2-ui"]/div[2]/div/button[1]/span').click()
time.sleep(2)
while True:
while search:
post = driver.find_element(By.CSS_SELECTOR,value='.post-container a img')
if post.id in used:
search = True
else:
search = False

post_url = post.get_attribute('src')
post_title = post.get_attribute('alt')
used.append(post.id)
print(post_url)
print(post_title)
print('......')
print(used)
print(post.id)
time.sleep(20)

问题:他将使用过的图像添加到列表中,但他仍然找到并复制它…

https://img-9gag-fun.9cache.com/photo/aWgNx36_460s.jpg
Fat acceptance activist on the news was so fat they had to put her in landscape mode
......
['4d17ee3f-213a-4d54-a18f-425f8f4dea4b']
4d17ee3f-213a-4d54-a18f-425f8f4dea4b
https://img-9gag-fun.9cache.com/photo/aWgNx36_460s.jpg
Fat acceptance activist on the news was so fat they had to put her in landscape mode
......
['4d17ee3f-213a-4d54-a18f-425f8f4dea4b', '4d17ee3f-213a-4d54-a18f-425f8f4dea4b']
4d17ee3f-213a-4d54-a18f-425f8f4dea4b
https://img-9gag-fun.9cache.com/photo/aWgNx36_460s.jpg
Fat acceptance activist on the news was so fat they had to put her in landscape mode
......
['4d17ee3f-213a-4d54-a18f-425f8f4dea4b', '4d17ee3f-213a-4d54-a18f-425f8f4dea4b', '4d17ee3f-213a-4d54-a18f-425f8f4dea4b']
4d17ee3f-213a-4d54-a18f-425f8f4dea4b

编辑:代码:

while True:
driver.switch_to.window(gag_tab)
post = driver.find_elements(By.CSS_SELECTOR,value='.post-container a img')

for post in post:
post_url = post.get_attribute('src')
post_title = post.get_attribute('alt')
#paste the the url  and title in to another site
time.sleep(20)

错误:

Traceback (most recent call last):
File "main.py", line 86, in <module>
post_url = post.get_attribute('src')
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
(Session info: chrome=101.0.4951.67)
Stacktrace:
Backtrace:
Ordinal0 [0x009CB8F3+2406643]
Ordinal0 [0x0095AF31+1945393]
Ordinal0 [0x0084C748+837448]
Ordinal0 [0x0084F154+848212]
Ordinal0 [0x0084F012+847890]
Ordinal0 [0x0084F98A+850314]
Ordinal0 [0x008A50C9+1200329]
Ordinal0 [0x0089427C+1131132]
Ordinal0 [0x008A4682+1197698]
Ordinal0 [0x00894096+1130646]
Ordinal0 [0x0086E636+976438]
Ordinal0 [0x0086F546+980294]
GetHandleVerifier [0x00C39612+2498066]
GetHandleVerifier [0x00C2C920+2445600]
GetHandleVerifier [0x00A64F2A+579370]
GetHandleVerifier [0x00A63D36+574774]
Ordinal0 [0x00961C0B+1973259]
Ordinal0 [0x00966688+1992328]
Ordinal0 [0x00966775+1992565]
Ordinal0 [0x0096F8D1+2029777]
BaseThreadInitThunk [0x75B9FA29+25]
RtlGetAppContainerNamedObjectPath [0x77C77A7E+286]
RtlGetAppContainerNamedObjectPath [0x77C77A4E+238]

首先:您忘记在打印最后一篇文章后放入search = True,因此它总是跳过循环并打印出第一篇文章。但即使这样,您还没有完成,因为driver.find_element()总是搜索与您的参数匹配的第一个元素,因此它会陷入无休止的循环,因为第一个帖子在used列表中,并且会将search设置为True

尝试使用driver.find_elements()代替。这将创建一个包含所有帖子的列表,因此您可以循环遍历该列表并打印每个帖子,如下所示:

posts = driver.find_elements(by=By.CSS_SELECTOR, value='.post-container a img')
for post in posts:
post_url = post.get_attribute('src')
post_title = post.get_attribute('alt')
used.append(post.id)
print(post_url)
print(post_title)
print('......')
print(used)
print(post.id)
time.sleep(2)

编辑:

由于driver.find_elements()将只接收到目前为止在网站上加载的帖子,因此每当向下滚动页面时,您需要再次调用它。这就是为什么我放入while循环并忽略已经打印的帖子。关于StaleElementReferenceException,我放了一个try-except块来忽略那些不再可引用的元素。当你向下滚动网站的速度太快时,就会发生这种情况。您可以像这样导入这些异常:

from selenium.common.exceptions import StaleElementReferenceException
from selenium.common.exceptions import WebDriverException

确保没有命名冲突。

这是我当前的解决方案:

used = []
while True:
posts = driver.find_elements(by=By.CSS_SELECTOR, value='.post-container a img')
for post in posts:
if not post.id in used:
try:
post_url = post.get_attribute('src')
post_title = post.get_attribute('alt')
except StaleElementReferenceException or WebDriverException:
continue
used.append(post.id)
print(post_title)
print(post_url)
print('__________')
time.sleep(2)

您需要手动或自动向下滚动站点(Selenium有一个用于驱动程序execute_script()的功能,您可以逐步执行滚动命令)以加载更多可以打印的帖子。

变量"post"没有相对的上下文(value以句点开头)。由于没有对实际网页结构的描述,因此很难确定您需要的正确代码。

我发现这两个YouTube视频很有教育意义:

  • 如何使用SELENIUM在PYTHON中自动化Web。Pt1: https://www.youtube.com/watch?v=pUUhvJvs-R4
  • 如何使用Selenium抓取动态网站:https://www.youtube.com/watch?v=lTypMlVBFM4

最新更新