如果网站的内容都存在于一个标签中,如何抓取网站(使用 Python3)?



正在努力刮一个有选举结果的网站。所有内容都在一个<pre />标签中。显然,使用python 3将其解析为json是非常困难的。

在前一年,当我不得不浏览这个网站时,我只需要两场比赛的结果信息,所以我做了这样的事情:

def scrape_kendall():
COUNTY_NAME = "Kendall"
# sets URLs
KENDALL_RACE_URL = 'https://results.co.kendall.il.us/'

#gets data
html = urllib.request.urlopen(KENDALL_RACE_URL).read()
soup = BeautifulSoup(html, 'html.parser')
# creates empty list for results info
kendall_county_results = []
data = soup.find('pre').text
precincts_total = 87
rows = data.splitlines()
for index, row in enumerate(rows):
if row.startswith(" PRECINCTS"):
precincts_reporting = int(row[-2:])
if row == "COUNTY BOARD MEMBER-DIST.1":
dist1_race_name = row
dist1_race_obj = initialize_race_obj(dist1_race_name,precincts_reporting,precincts_total,COUNTY_NAME)
if index >= 115 and index <= 119: # hard-coded
cand_index = int(str(index)[-1:]) - 2
cand_info, full_name, party = get_candidate_info(row)

first_name, middle_name, last_name = parse_name(full_name)
votes = get_vote_count(cand_info)

formatted_candidate_info = get_candidates_in_race_obj( 
first_name, middle_name, last_name, 
votes, party, cand_index)
dist1_race_obj["reporting_units"][0]['candidates'].append(formatted_candidate_info)
...etc

这导致数据看起来像这样:

[
{
"name": "County Board Member-Dist.1",
"description": "",
"election_date": "2020-11-03",
"market": "chinews",
"uncontested": false,
"amendment": false,
"state_postal": "IL",
"recount": false,
"reporting_units": [
{
"name": "Kendall",
"level": "county",
"district_type": "",
"state_postal": "IL",
"geo_id": "",
"electoral_vote_total": 0,
"precincts_reporting": 0,
"total_precincts": 87,
"data_source_update_time": "2020-11-20T20:10:15+0000",
"candidates": [
{
"first_name": "Scott",
"middle_name": "",
"last_name": "Gengler",
"vote_count": 14696,
"party": "REP",
"ballot_order": 3
},
{
"first_name": "Brian",
"middle_name": "E.",
"last_name": "Debolt",
"vote_count": 12867,
"party": "REP",
"ballot_order": 4
},```
...etc.

intialize_race_objget_candidate_infoparse_nameget_vote_count都是util函数,有些还涉及一些硬编码。因为我只需要两场比赛的结果信息,所以我对一些东西进行了硬编码,并使用了if语句(如上所述(。在未来,我可能需要10或20场比赛的信息,我不准备硬编码或在这种情况下使用if语句。关于如何以更编程的方式使用python 3来抓取这个网站,有什么想法吗?

我不认为有一个特定的答案总是有效的。在您的情况下,不同部分之间有明确的入口。因此,我将创建一个手动解析器,专注于获取这些不同的部分。

下面显示了我可以想出的一些示例代码,但我首先要提到我所采取的步骤。

  1. 从网站获取数据集,并将其存储在本地文件中(节省一些能量(。

  2. 手动查找拆分汇总数据(顶部的标题(和带有票数的正文的点。

  3. 手动逐行分析标题,如果有什么变化,这会中断,但嘿,你可能只需要做一次(交叉手指(。

  4. 解析主体,我将主体划分为多个部分,其中每个部分都包含在两个输入之间。就是一个例子

    AURORA MAYOR
    VOTE FOR  1
    (WITH 3 OF 3 PRECINCTS COUNTED)
    RICHARD C. IRVIN .  .  .  .  .  .  .  .        237   62.20           207            30             0
    JUDD LOFCHIE  .  .  .  .  .  .  .  .  .         59   15.49            56             3             0
    JOHN LAESCH.  .  .  .  .  .  .  .  .  .         85   22.31            63            22             0
    
  5. 然后手动解析该部分,我为其留出空间来解析每个候选者。

现在我还没有完全完成刮擦,但这对你来说是乐趣的一部分。但这应该为你提供一个如何处理任意大量部分和候选人的框架。

代码

import itertools
import urllib.request
from argparse import Namespace
from pprint import pprint
from bs4 import BeautifulSoup

def get_data(url, file='data.txt'):
""" Retrieve the bare bone data from a weblink and stores it in provided file.  """
with urllib.request.urlopen(url) as page:
soup = BeautifulSoup(page.read(), 'html.parser')
data = soup.find('pre').text.split('n')
with open(file, 'w') as file:
file.writelines(data)

def clean_data(file='data.txt', header=15, ignore=False):
"""
Clean the data, where the first n lines are for the header or ignored.
:param file: (str) Name of the file to load.
:param header: (int) Number of lines used for header or skipped when ignore is True.
:param ignore: (bool) If True, skips the lines indicated by header.
:return:
"""
with open(file, 'r') as file:
data = file.readlines()
header, body = data[:header], data[header:]
data_header = generate_header(header)
data_body = generate_body(body, columns=data_header.columns)
# pprint(vars(data_header))
pprint(vars(data_body))

def parse_numbers(line: str, columns, missing: list = None, fill_value='-') -> dict:
values = list(filter(str.strip, line.split('  ')))
if len(values) == len(columns):
return dict(zip(columns, values))
if all(int(value) == 0 for value in values):
return dict(zip(columns, ['0'] * len(columns)))
raise ValueError(f"Unknown handling of missing values."
f"nColumns: {columns}Line: n{line}Values: n{values}")

def generate_header(header: list[str]):
""" Manually parse the header (hopefully only once).  """
clean_data = list(filter(bool, ''.join(header).split('n')))
name, description, status = list(filter(str.strip, clean_data[0].split('  ')))
date = clean_data[1].strip()
country, state = list(map(str.strip, clean_data[2].split(',')))
election_date = clean_data[3].strip()
columns = list(filter(str.strip, clean_data[4].split('  ')))
summary = {}
for row in clean_data[5:11]:
pass
return Namespace(
name=name,
description=description,
status=status,
date=date,
country=country,
state=state,
election_date=election_date,
columns=columns,
summary=summary
)

def generate_body(body: list[str], columns=None):
clean_body = list(map(str.strip, ''.join(body).split('n')))
# https://stackoverflow.com/a/52943710/10961342
sections = [list(group) for key, group in itertools.groupby(clean_body, key=bool) if key]
metadata = []
for section in sections:
function = section[0]
vote = [row.startswith('VOTE FOR') for row in section].index(True)  # locate where `VOTE FOR`
info = ' '.join(map(str.strip, section[1:vote + 2]))
candidates = []
for candidate in section[vote + 2:]:
name = candidate.split('.')[0].strip()
numbers = candidate.rsplit('.  .')[-1]
data = parse_numbers(numbers, columns)
candidates.append({"name": name, "data": data})
metadata.append({"function": function, "info": info, "candidates": candidates})
pprint(metadata, sort_dicts=False)
return Namespace(body=metadata)

if __name__ == '__main__':
# Retrieve the original data set.
# get_data('https://results.co.kendall.il.us/')
clean_data()

输出

[{'function': 'AURORA MAYOR',
'info': 'VOTE FOR  1 (WITH 3 OF 3 PRECINCTS COUNTED)',
'candidates': [{'name': 'RICHARD C',
'data': {'TOTAL VOTES': '237',
' %': ' 62.20',
'ELECTION DAY': ' 207',
' EV, VBM': '30',
'PROV, POST': ' 0'}},
{'name': 'JUDD LOFCHIE',
'data': {'TOTAL VOTES': ' 59',
' %': ' 15.49',
'ELECTION DAY': '56',
' EV, VBM': ' 3',
'PROV, POST': ' 0'}},
{'name': 'JOHN LAESCH',
'data': {'TOTAL VOTES': ' 85',
' %': ' 22.31',
'ELECTION DAY': '63',
' EV, VBM': '22',
'PROV, POST': ' 0'}}]},
{'function': 'AURORA ALDERMAN AT LARGE',
'info': 'VOTE FOR  1 (WITH 3 OF 3 PRECINCTS COUNTED)',
'candidates': [{'name': 'RON WOERMAN',
'data': {'TOTAL VOTES': '117',
' %': ' 34.01',
'ELECTION DAY': ' 106',
' EV, VBM': '11',
'PROV, POST': ' 0'}},
{'name': 'BROOKE SHANLEY',
'data': {'TOTAL VOTES': '168',
' %': ' 48.84',
'ELECTION DAY': ' 136',
' EV, VBM': '32',
'PROV, POST': ' 0'}},
{'name': 'RAYMOND HULL',
'data': {'TOTAL VOTES': ' 59',
' %': ' 17.15',
'ELECTION DAY': '52',
' EV, VBM': ' 7',
'PROV, POST': ' 0'}}]},
{'function': 'AURORA ALDERMAN WARD 9',
'info': 'VOTE FOR  1 (WITH 3 OF 3 PRECINCTS COUNTED)',
'candidates': [{'name': 'EDWARD J',
'data': {'TOTAL VOTES': '339',
' %': '100.00',
'ELECTION DAY': ' 285',
' EV, VBM': '54',
'PROV, POST': ' 0'}}]},
{'function': 'JOLIET COUNCILMAN AT LARGE',
'info': 'VOTE FOR  3 (WITH 7 OF 7 PRECINCTS COUNTED)',
'candidates': [{'name': 'GLENDA WRIGHT-McCULLUM',
'data': {'TOTAL VOTES': ' 96',
' %': '7.81',
'ELECTION DAY': '91',
' EV, VBM': ' 5',
'PROV, POST': ' 0'}},
{'name': 'NICOLE LURRY',
'data': {'TOTAL VOTES': ' 77',
' %': '6.27',
'ELECTION DAY': '70',
' EV, VBM': ' 7',
'PROV, POST': ' 0'}},
{'name': 'JEREMY BRZYCKI',
'data': {'TOTAL VOTES': ' 90',
' %': '7.32',
'ELECTION DAY': '78',
' EV, VBM': '12',
'PROV, POST': ' 0'}},
{'name': 'CESAR GUERRERO',
'data': {'TOTAL VOTES': '106',
' %': '8.62',
'ELECTION DAY': '95',
' EV, VBM': '11',
'PROV, POST': ' 0'}},
{'name': 'ISIAH WILLIAMS JR',
'data': {'TOTAL VOTES': ' 47',
' %': '3.82',
'ELECTION DAY': '45',
' EV, VBM': ' 2',
'PROV, POST': ' 0'}},
{'name': 'HUDSON HOLLISTER',
'data': {'TOTAL VOTES': ' 84',
' %': '6.83',
'ELECTION DAY': '72',
' EV, VBM': '12',
'PROV, POST': ' 0'}},
{'name': 'JAMES LANHAM',
'data': {'TOTAL VOTES': ' 32',
' %': '2.60',
'ELECTION DAY': '29',
' EV, VBM': ' 3',
'PROV, POST': ' 0'}},
{'name': 'ROGER POWELL',
'data': {'TOTAL VOTES': ' 56',
' %': '4.56',
'ELECTION DAY': '55',
' EV, VBM': ' 1',
'PROV, POST': ' 0'}},
{'name': 'WARREN C',
'data': {'TOTAL VOTES': ' 76',
' %': '6.18',
'ELECTION DAY': '66',
' EV, VBM': '10',
'PROV, POST': ' 0'}},
{'name': 'ROBERT WUNDERLICH',
'data': {'TOTAL VOTES': '166',
' %': ' 13.51',
'ELECTION DAY': ' 149',
' EV, VBM': '17',
'PROV, POST': ' 0'}},
{'name': 'JOE CLEMENT',
'data': {'TOTAL VOTES': '203',
' %': ' 16.52',
'ELECTION DAY': ' 190',
' EV, VBM': '13',
'PROV, POST': ' 0'}},
{'name': 'JAN QUILLMAN',
'data': {'TOTAL VOTES': '196',
' %': ' 15.95',
'ELECTION DAY': ' 184',
' EV, VBM': '12',
'PROV, POST': ' 0'}}]},
{'function': 'PLANO MAYOR',
'info': 'VOTE FOR  1 (WITH 11 OF 11 PRECINCTS COUNTED)',
'candidates': [{'name': 'ROBERT "BOB" HAUSLER (IND)',
'data': {'TOTAL VOTES': '388',
' %': ' 48.50',
'ELECTION DAY': ' 336',
' EV, VBM': '52',
'PROV, POST': ' 0'}},
{'name': 'MIKE RENNELS (IND)',
'data': {'TOTAL VOTES': '412',
' %': ' 51.50',
'ELECTION DAY': ' 352',
' EV, VBM': '60',
'PROV, POST': ' 0'}}]},
...