我是新手。我制作了一个从网站上删除数据的脚本,它运行良好,我将结果作为JSON文件获得,看起来非常完美。现在,当我尝试使用我的脚本来废弃多个URL(同一个网站(时,它是有效的,我可以获得每个URL的JSON文件中的数据,但有一个错误。我的打印结构如下(按脚本编码(
[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:} #URL1
]
当我把2个网址报废时,我得到了这个:
[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:},#URL1
{Title:,,,Description:,,,Brochure:}, #URL2
{titleDesc:,,,Content:}, #URL2
{attribute:} #URL2
]
它仍然很好,但当我添加更多时,结构会变得一团糟,变成这样:
[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:}, #URL1
{Title:,,,Description:,,,Brochure:}, #URL2
{Title:,,,Description:,,,Brochure:}, #URL3
{titleDesc:,,,Content:}, #URL2
{attribute:}, #URL2
{titleDesc:,,,Content:}, #URL3
{attribute:}
]
如果你仔细观察,你会注意到第三个URL的标题低于第二个URL的名称。有人能帮忙吗?
import scrapy
class QuotesSpider(scrapy.Spider):
name = "attributes"
start_urls = ["https://product.sanyglobal.com/concrete_machinery/truck_mixer/119/161/",
"https://product.sanyglobal.com/concrete_machinery/truck_mixer/119/162/"]
def parse(self, response):
yield{
"title": response.css ("div.sku-top-title::text").get(),
"desc" : response.css ("div.sku-top-desc::text").get(),
"brochure" :'brochure'
}
for post in response.css(".el-collapse"):
for i in range(len(post.css(".el-collapse-item__header"))):
res=""
lst=post.css(".value-el-desc")
x=lst[i].css(".value-el-desc p::text").extract()
for y in x:
res+=y.strip()+"&&"
try:
yield{
"descTitle" : post.css('.el-collapse-item__header::text')[i].get().strip(),
"desc" :res
}
except:
continue
res=""
for post in response.css(".lie-one-canshu"):
try:
dicti = {"attribute" : post.css('.lie-one-canshu::text')[0].get().strip()}
yield dicti
except:
continue
更新:我注意到这个错误不是永久性的,有时我会执行脚本,结果很好。
Scrapy是异步的,因此不能保证输出或处理项目的顺序,至少不能开箱即用。如果您希望单个URL的所有输出都能一起输出,那么我建议您在每次调用解析方法时只生成一项。。。。
例如:
def parse(self, response):
results = {
'items': [{
"title": response.css ("div.sku-top-title::text").get(),
"desc" : response.css ("div.sku-top-desc::text").get(),
"brochure" :'brochure'
}]
}
for post in response.css(".el-collapse"):
for i in range(len(post.css(".el-collapse-item__header"))):
res=""
lst=post.css(".value-el-desc")
x=lst[i].css(".value-el-desc p::text").extract()
for y in x:
res+=y.strip()+"&&"
try:
results['items'].append({
"descTitle" : post.css('.el-collapse-item__header::text')[i].get().strip(),
"desc" : res
})
except:
continue
res = ""
for post in response.css(".lie-one-canshu"):
try:
results['items'].append({
"attribute" : post.css('.lie-one-canshu::text')[0].get().strip()
})
except:
continue
yield results