我有这段代码,我试图下载这些论文,但循环只打印第一个元素。
进口废料从urllib.parse导入urljoin
class SimpleSpider(scratchy.Spider(:name="简单"start_urls=[]https://jmedicalcasereports.biomedcentral.com/articles?query=COVID-19&searchType=journalSearch&tab=关键字']
def parse(self, response):
for book in response.xpath('//*[@id="main-content"]/div/main/div[2]/ol'):
title= response.xpath('/li[3]/article/h3/a/text()').get()
link = urljoin(
'https://jmedicalcasereports.biomedcentral.com/',response.xpath('/li[3]/article/ul/li[2]/a/@href').get()
)
yield {
'Title':title,
'file_urls':[link]
}
我使用了css,然后使用了xpath,问题是循环代码。
首先,在代码的第三行中,响应可以更改为标题
title= book.xpath('.//a/text()').get()
其次,在第二行中,您给出了一个错误的xpath。所以结果是不正确的。这是我的密码。希望这能帮助到你。
def parse(self, response):
for book in response.xpath('//li[@class = "c-listing__item"]'):
title= book.xpath('.//a/text()').get()
link = urljoin(
'https://jmedicalcasereports.biomedcentral.com/',book.xpath('.//a/@href').get()
)
yield {
'Title':title,
'file_urls':[link]
}
答案是:
{'Title': 'Presentation of COVID-19 infection with bizarre behavior and
encephalopathy: a case report', 'file_urls':
['https://jmedicalcasereports.biomedcentral.com/articles/10.1186/s13256-021-
02851-0']}
2022-04-17 21:54:27 [scrapy.core.scraper] DEBUG: Scraped from <200
https://jmedicalcasereports.biomedcentral.com/articles?query=COVID-
19&searchType=journalSearch&tab=keyword>
{'Title': 'Dysentery as the only presentation of COVID-19 in a child: axa0case
report', 'file_urls':
['https://jmedicalcasereports.biomedcentral.com/articles/10.1186/s13256-021-
02672-1']}