美汤解析表列和剥离换行符

我使用以下代码循环遍历html表的每一行和每一列

data = []
table = page.find('table', attrs={'class':'table table-no-border table-hover table-striped keyword_result_table'})
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values

这个表列给了我一些

<td class="keyword">
<span class="is_in_saved_list" id="is_in_saved_list_81864060">
</span>
<a href="javascript:void(0);">
<b>
what
</b>
<b>
is
</b>
<b>
in
</b>
<b>
house
</b>
<b>
paint
</b>
</a>
</td>

输出为

['whatn nn n inn nn n housen nn paint nn '， '5756'， '979'， '2'， 'Great'， '89'， '。comn nn .netn nn n .org']

在控制台和这里的提示屏幕上，似乎有制表符空格，但它们没有显示在帖子中。我在strip()之后尝试了.rstrip()，但没有变化。是否有一种方法来抓取只有文本内容的链接附加到?

您可以使用.stripped_strings来获取没有任何空格/制表符的文本。

代码如下:

import bs4 as bs
s = """
<td class="keyword">
<span class="is_in_saved_list" id="is_in_saved_list_81864060">
</span>
<a href="javascript:void(0);">
<b>
what
</b>
<b>
is
</b>
<b>
in
</b>
<b>
house
</b>
<b>
paint
</b>
</a>
</td>
"""
soup = bs.BeautifulSoup(s, 'lxml')
t = soup.find('td')
print(list(t.stripped_strings))

['what', 'is', 'in', 'house', 'paint']

您是否尝试从字符串中删除'n' ?

s = 'whatn nn isn nn inn nn housen nn paint'
s.replace('n', '')
'what  is  in  house  paint'

相关内容

最新更新

热门标签：