如何从此链接中抓取子标题?



我做了一个网络抓取器,可以从看起来像这样的页面中抓取数据(它会抓取表格(: https://www.techpowerup.com/gpudb/2/

问题是我的程序出于某种原因只抓取值,而不是副标题。例如,(点击链接(,它只抓取"R420"、"130nm"、"1.6亿"等,而不是"GPU 名称"、"工艺尺寸"、"晶体管"等。

我应该在代码中添加什么才能让它抓取副标题?这是我的代码:

import csv
import requests
import bs4
url = "https://www.techpowerup.com/gpudb/2"

#obtain HTML and parse through it
response = requests.get(url)
html = response.content
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
soup = bs4.BeautifulSoup(html, "lxml")
tables = soup.findAll("table")
#reading every value in every row in each table and making a matrix 
tableMatrix = []
for table in tables:
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
tableMatrix.append((list_of_rows, list_of_cells))
#(YOU CAN PROBABLY IGNORE THIS)placeHolder used to avoid duplicate data from appearing in list 
placeHolder = 0
excelTable = []
for table in tableMatrix:
for row in table:
if placeHolder == 0:
for entry in row:
excelTable.append(entry)
placeHolder = 1
else:
placeHolder = 0
excelTable.append('n')
for value in excelTable:
print value
print 'n'

#create excel file and write the values into a csv 
fl = open(str(count) + '.csv', 'w')
writer = csv.writer(fl)
for values in excelTable:
writer.writerow(values)
fl.close()   

如果您检查页面源代码,这些单元格是标题单元格。所以他们使用的不是TD标签,而是TH标签。您可能需要更新环路以将 TH 细胞与 TD 细胞一起包含。

最新更新