如何处理Python中具有合并(colspan=2)列的html表(最好使用Beautifulsoup)



我正试图从该页的第二个表中提取偏好分布数据。就背景而言,我们的计划是确定每个候选人的政党,看看我喜欢的政党在被淘汰之前有多少票。这是我第一次尝试网络抓取,所以在相当大的个人痛苦中,我设法解析了页面并从相关表中获取了数据。

from bs4 import BeautifulSoup
# Open and read html
f = open("https://results.ecq.qld.gov.au/elections/state/State2017/results/booth1.html", "r")
contents = f.read()

# Parse the html data and then get to the preference distribution table
soup = BeautifulSoup(contents, 'html.parser')
useful_data = (soup.find_all(class_="resultTableBorder")[2].find_all("tr")[1:])
# Extract the results of the preference distribution
data = []
for row in list(useful_data): 
sub_data = []    
for cell in row.find_all("td"): 
try: 
#target = 
sub_data.append(cell.get_text(strip = True))
target = ""
except: 
continue
data.append(sub_data)
sub_data = []

然而,当我检查是否有一个格式良好的列表列表时,我没有。

# Check if I have a nicely formed table of data. I do not.
for index, row in enumerate(data, start = 1):
try:
length = len(row)
print("Row " + str(index) + " contains " + str(length) + " elements.")
except:
continue

这产生了以下结果,表明将列标题与相关数据进行匹配、忽略水平线并处理不同数量的候选人(共有93名选民,这只是第一次(将是一件麻烦的事。

Row 1 contains 8 elements.
Row 2 contains 10 elements.
Row 3 contains 1 elements.
Row 4 contains 13 elements.
Row 5 contains 13 elements.
Row 6 contains 13 elements.
Row 7 contains 13 elements.
Row 8 contains 1 elements.
Row 9 contains 13 elements.
Row 10 contains 1 elements.
Row 11 contains 5 elements.
Row 12 contains 2 elements.
Row 13 contains 2 elements.
Row 14 contains 1 elements.

有没有一种简单的方法可以做到这一点,要么在提取偏好分布时使用巧妙的技巧,要么处理我提取的数据?

在这种情况下,这样做更容易:

import pandas as pd
tables = pd.read_html('https://results.ecq.qld.gov.au/elections/state/State2017/results/booth1.html')
target_df = tables[5] #this is the Summary of Distribution of Preferences table
target_df.drop(target_df.tail(3).index).iloc[1:].dropna(how='all') #a little clean up

这应该会为您提供目标表。如有必要,您可以进行更多的清理、格式化等操作,或者使用标准panda方法提取到列表中。

最新更新