感觉我在这里没有掌握一些概念,或者在我可以爬行之前试图飞翔(双关语(。
页面上确实有 5 张桌子,我感兴趣的是第 3 张。但是执行这个:
#!/usr/bin/python
# python 3.x
import sys
import os
import re
import requests
import scrapy
class iso3166_spider( scrapy.Spider):
name = "countries"
def start_requests( self):
urls = ["https://en.wikipedia.org/wiki/ISO_3166-1"]
for url in urls:
yield scrapy.Request( url=url, callback=self.parse)
def parse( self, response):
title = response.xpath('//title/text()').get()
print("-- title -- {0}".format(title))
list_table_selector = response.xpath('//table') # gets all tables on page
print("-- table count -- {0}".format( len( list_table_selector)))
table_selector = response.xpath('//table[2]') # inspect to figure out which one u want
table_selector_text = table_selector.getall() # got the right table, starts with Afghanistan
# print( table_selector_text)
#
# here is where things go wrong
list_row_selector = table_selector.xpath('//tr')
print("number of rows in table: {0}".format( len( list_row_selector))) # gives 302, should be close to 247
for i in range(0,20):
row_selector = list_row_selector[i]
row_selector_text = row_selector.getall()
print("i={0}, getall:{1}".format(i, row_selector_text)
打印每个表中每一行的 getall(( - 我看到阿富汗的行是第 8 行而不是第 2 行
改变
list_row_selector = table_selector.xpath('//tr')
自
list_row_selector = table_selector.xpath('/tr')
结果在我期望大约 247 的地方找到零行
最终,我希望每个国家的名称和三个代码应该很简单。
我做错了什么?
TIA,
克尔楚克
tbl = response.xpath("//th[starts-with(text(),'English short name')]/ancestor::table/tbody/tr[position()>1]") # try this xpath. I check the source of web page, the header ("th" elements) line is under tbody also.
您也可以尝试将"//tr"替换为".//tr">