从html表中检索所有数据并放入csv中

我试图使一个python脚本检索我所有的数据从html表在几个页面(我有一个数组的链接)我想把表格中的数据放到csv文件中。我该如何继续?我做了类似的事情，但数据没有以列和行形式被放入，然后立即删除，然后放入下一个。我怎样才能以最干净的方式进行呢?这是表格

<div class="table-responsive">
<table class="table table-striped product-page-specifications">
<tbody><tr>
<td class="col-xs-4 text-muted">Product type</td>
<td class="col-xs-8">1</td>
</tr><tr>
<td class="col-xs-4 text-muted">Tip2</td>
<td class="col-xs-8">MMA
TIG/WIG
</td>
</tr><tr>
<td class="col-xs-4 text-muted">Material</td>
<td class="col-xs-8">Metal </td>
</tr><tr>
<td class="col-xs-4 text-muted">Size</td>
<td class="col-xs-8">Universal </td>
</tr><tr>
<td class="col-xs-4 text-muted">Color</td>
<td class="col-xs-8">Black</td>
</tr><tr>
<td class="col-xs-4 text-muted">Content</td>
<td class="col-xs-8">Material made of a material as resistant as possible</td>
</tr></tbody>
</table>
</div>

这是代码:

for a_link in all_links:
res = requests.get(a_link).text
soup = BeautifulSoup(res, 'html.parser')
table = soup.select_one("table")
output_rows = []
for table_row in table.findAll('tr'):
columns = table_row.findAll('td')
output_row = []
for column in columns:
output_row.append(column.text)
output_rows.append(output_row)
df = pd.DataFrame(output_rows)
print(df)

看起来pd.read_html在该表上工作得很好，尽管您可能需要稍后根据页面的其余部分以及您希望最终输出的显示方式进行一些按摩/合并:

In [13]: pd.read_html(StringIO(s))
Out[13]:
[              0                                                  1
0  Product type                                                  1
1          Tip2                                        MMA TIG/WIG
2      Material                                              Metal
3          Size                                          Universal
4         Color                                              Black
5       Content  Material made of a material as resistant as po...]

特别是，您可能希望将第一列设置为索引并进行转置，以便从中获得命名良好的列:

In [15]: pd.read_html(StringIO(s))[0].set_index(0).T
Out[15]:
0 Product type         Tip2 Material       Size  Color                                            Content
1            1  MMA TIG/WIG    Metal  Universal  Black  Material made of a material as resistant as po...

for a_link in all_links:
res = requests.get(a_link).text
soup = BeautifulSoup(res, 'html.parser')
table = soup.select_one("table")
rows = table.findAll('tr')
headers = rows[0]
header_text = []
for th in headers.findAll('th'):
header_text.append(th.text)
row_text_array = []
for row in rows[1:]:
row_text = []
# loop through the elements
for row_element in row.findAll(['th', 'td']):
# append the array with the elements inner text
row_text.append(row_element.text.replace('n', '').strip())
# append the text array to the row text array
row_text_array.append(row_text)
with open("out.csv", "w") as f:
wr = csv.writer(f)
wr.writerow(header_text)
# loop through each row array
for row_text_single in row_text_array:
wr.writerow(row_text_single)
df = pd.DataFrame(output_rows)
print(df)

这是完整的代码，但在CSV中不能正确显示@Randy

相关内容

最新更新

热门标签：