在表中返回太多行

  • 本文关键字:太多 返回 python scrapy
  • 更新时间 :
  • 英文 :


感觉我在这里没有掌握一些概念,或者在我可以爬行之前试图飞翔(双关语(。

页面上确实有 5 张桌子,我感兴趣的是第 3 张。但是执行这个:

#!/usr/bin/python
# python 3.x
import sys
import os
import re
import requests
import scrapy
class iso3166_spider( scrapy.Spider):
name = "countries"
def start_requests( self):
urls = ["https://en.wikipedia.org/wiki/ISO_3166-1"]
for url in urls:
yield scrapy.Request( url=url, callback=self.parse) 
def parse( self, response):
title = response.xpath('//title/text()').get()
print("-- title -- {0}".format(title))
list_table_selector = response.xpath('//table')   # gets all tables on page
print("-- table count -- {0}".format( len( list_table_selector)))
table_selector = response.xpath('//table[2]')     # inspect to figure out which one u want
table_selector_text = table_selector.getall()     # got the right table, starts with Afghanistan
#   print( table_selector_text)
#
#   here is where things go wrong
list_row_selector = table_selector.xpath('//tr')
print("number of rows in table: {0}".format( len( list_row_selector)))  # gives 302, should be close to 247
for i in range(0,20):
row_selector = list_row_selector[i]
row_selector_text = row_selector.getall()
print("i={0}, getall:{1}".format(i, row_selector_text)

打印每个表中每一行的 getall(( - 我看到阿富汗的行是第 8 行而不是第 2 行

改变

list_row_selector = table_selector.xpath('//tr')

list_row_selector = table_selector.xpath('/tr')

结果在我期望大约 247 的地方找到零行

最终,我希望每个国家的名称和三个代码应该很简单。

我做错了什么?

TIA,

克尔楚克

tbl = response.xpath("//th[starts-with(text(),'English short name')]/ancestor::table/tbody/tr[position()>1]") # try this xpath. I check the source of web page, the header ("th" elements) line is under tbody also.

您也可以尝试将"//tr"替换为".//tr">

最新更新