从 TBODY 中提取 TD



如何从tr中提取第一委员会,立法委员会,特设委员会?


<tr>
<td class="text-center">1</td>
<td class="hidden-xs"><a href="/en/anggota/detail/id/1319"><img class="img-responsive" src="/doksigota/photo/1319.jpg"/></a></td>
<td><a href="/en/anggota/detail/id/1319">PROF. DR. BACHTIAR ALY, MA</a><br/>National Democrat Party Faction<br/>ACEH I</td>
<td>Commission I<br/>Legislation Committee<br/>Ad-Hoc Committee</td> </tr
webpage_response = requests.get('http://www.dpr.go.id/en/anggota')
webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")
tbody = soup.find("tbody")
for i in tbody:
print(i)

这是一个完整的解决方案。

溶液

步骤-1

获取表体。

# Show exported info as a table
import pandas as pd 
# Progressbar
from tqdm import tqdm, tqdm_notebook, tnrange 
# Read HTML Page
from bs4 import BeautifulSoup
# Access Web URL
import requests
base_url = 'http://www.dpr.go.id/en/anggota'
webpage_response = requests.get(base_url)
webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")
tbody = soup.find("tbody")

步骤-2

循环访问表行元素:<tr>...</tr>。对于每个表行,提取列并将其保存到dictrows_dict中。

rows_dict = dict()
trs = tbody.find_all('tr')
for tr in tqdm_notebook(trs, desc='Progress'):    
row_dict = extract_columns(tr)
rows_dict.update({row_dict['ID']: row_dict})

步骤-3

使用pandas 数据帧表格形式显示提取的数据。请注意,您对列感兴趣:['COL_2', 'COL_3', 'COL_4']。但是,如果您需要访问同一html表中的任何其他数据,现在您在数据帧df中也有这些数据。

df = pd.DataFrame(rows_dict).T
headers = ['ID', 'CANDIDATE', 'AFFILIATION', 'COL_1', 'COL_2', 'COL_3', 'COL_4', 'IMAGE_URL', 'URL']
df = df[headers]
df.head()

必要的自定义函数:

def extract_columns(tr, debug_flag = False):    
# headers = ['ID', 'CANDIDATE', 'AFFILIATION', 'COL_1', 'COL_2', 'COL_3', 'COL_4', 'IMAGE_URL', 'URL']
tds = tr.find_all('td')
urls = tds[1].find('a', href=True)
person = tds[2].find('a', href=True)
party = drop_html_br_tags(tds[2].contents.copy())
party_len = len(party)
committees = drop_html_br_tags(tds[3].contents.copy())
committees_len = len(committees)
if debug_flag: 
print(party)
print(committees)
row_dict = {'ID': tds[0].text, 
'URL': base_url+str(urls['href']), 
'IMAGE_URL': base_url+str(urls.find('img').get_attribute_list('src')[0]), 
'CANDIDATE': person.text.strip(),
'AFFILIATION': party[1].strip() if (party_len>1) else None,
'COL_1': party[2].strip() if (party_len>2) else None, 
'COL_2': committees[0].strip() if (committees_len>0) else None,  
'COL_3': committees[1].strip() if (committees_len>1) else None, 
'COL_4': committees[2].strip() if (committees_len>2) else None, 
}
return row_dict
def drop_html_br_tags(lines):
for line in lines:
if str(line) in ['<br/>']:
#print(f'{i}')
#print('{}'.format(line))
lines.remove(line)
return lines

测试函数:

row_dict = extract_columns(trs[2])
row_dict 

输出

{'AFFILIATION': 'National Democrat Party Faction',
'CANDIDATE': 'PRANANDA SURYA PALOH',
'COL_1': 'SUMATERA UTARA I',
'COL_2': 'Commission I',
'COL_3': 'Committee for Inter-Parliamentary Cooperation',
'COL_4': None,
'ID': '3',
'IMAGE_URL': 'http://www.dpr.go.id/en/anggota/doksigota/photo/1329.jpg',
'URL': 'http://www.dpr.go.id/en/anggota/en/anggota/detail/id/1329'}

最新更新