如何使用美丽汤抓取嵌套的表格?



大家好,感谢您的帮助。我被困在抓取嵌套表上。我能够抓取主表,但是当我找到包含其他表的表行时,我真的不知道如何继续。html 表是这样的:

<tr class="table">
<td class="table" valign="top">
<p class="tbl-cod">0403</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Buttermilk, curdled milk and&nbsp;cream, yoghurt, kephir and other fermented or acidified milk and&nbsp;cream, whether or not concentrated or&nbsp;containing added sugar or other sweetening matter or flavoured or&nbsp;containing added fruit, nuts or&nbsp;cocoa</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Manufacture in which:</p>
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<colgroup><col width="4%">
<col width="96%">
</colgroup><tbody>
<tr>
<td valign="top">
<p class="normal">—</p>
</td>
<td valign="top">
<p class="normal">all the materials of Chapter&nbsp;4 used are wholly obtained,</p>
</td>
</tr>
</tbody>
</table>
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<colgroup><col width="4%">
<col width="96%">
</colgroup><tbody>
<tr>
<td valign="top">
<p class="normal">—</p>
</td>
<td valign="top">
<p class="normal">all the fruit juice (except that of pineapple, lime or&nbsp;grapefruit) of heading&nbsp;2009 used is originating,</p>
<p class="normal">and</p>
</td>
</tr>
</tbody>
</table>
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<colgroup><col width="4%">
<col width="96%">
</colgroup><tbody>
<tr>
<td valign="top">
<p class="normal">—</p>
</td>
<td valign="top">
<p class="normal">the value of all the materials of Chapter&nbsp;17 used does not exceed 30&nbsp;% of the ex-works price of the product</p>
</td>
</tr>
</tbody>
</table>
</td>
<td class="table" valign="top">
<p class="normal">&nbsp;</p>
</td>
</tr>

我使用以下代码抓取了主表:

with open ('algeriaroo.txt', 'w') as algroo:
for row in RoOtbody.find_all('tr'):
for cell in row.find_all('td'):
algroo.write(cell.text.strip())
algroo.write('n')

到目前为止,我得到了这种抓取:

0403Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoaManufacture in which:



—

all the materials of Chapter 4 used are wholly obtained,





—

all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,
and





—

the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product—all the materials of Chapter 4 used are wholly obtained,—all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,
and—the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product
—all the materials of Chapter 4 used are wholly obtained,
—all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,
and
—the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product

我想刮这样的东西:

0403酪乳、凝乳和奶油、酸奶、凯菲尔等 发酵或酸化的牛奶和奶油,无论是否浓缩 或含有添加糖或其他甜味物质或调味物 或含有添加的水果、坚果或可可制造,其中: — 全部 第4章使用的材料完全获得,-所有水果 2009年标题的果汁(菠萝,酸橙或葡萄柚除外( 使用是原产地,并且 - 所有材料的价值 第17章使用的不超过出厂价的30% 产品

提前感谢您的帮助!

您可能正在搜索.get_text()带有separator=参数的方法。

例如(html_code包含您问题的 html 代码(:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_code, 'html.parser')
print(soup.select_one('tr.table').get_text(strip=True, separator=' '))

指纹:

0403 酪乳、凝乳及奶油、酸奶、凯菲尔等 发酵或酸化的牛奶和奶油,无论是否浓缩 或含有添加糖或其他甜味物质或调味物 或含有添加的水果、坚果或可可 制造: — 所有 第4章使用的材料完全获得,-所有水果 2009年标题的果汁(菠萝,酸橙或葡萄柚除外( 使用是原产地,并且 - 所有材料的价值 第17章使用的不超过出厂价的30% 产品

只是一个建议。 可以在函数中添加从表中提取数据的逻辑。检查每个 td 是否 它有标签,如果存在,则调用相同的函数 唯一的问题是返回值可能是创建一个字典并返回到调用函数并处理它。 这将对任意数量的嵌套表有所帮助。

最新更新