使用 bs4 用换行符替换标记时出现问题 - Problem Replacing Tags with Newline Using bs4 小贝子编程网

问题：我无法使用美丽的汤4将 标签替换为换行符。

代码：我的程序(它的相关部分(目前看起来像

for br in board.select('br'):
br.replace_with('n')

但我也尝试过用board.find_all()代替board.select().

结果：当我使用board.replace_with('n')所有 标签都替换为字符串文字n。例如，Hello world最终会变得Hellonworld。使用board.replace_with(n)会导致错误

File "<ipython-input-27-cdfade950fdf>", line 10
br.replace_with(n)
^
SyntaxError: unexpected character after line continuation character

其他信息：我正在使用Jupyter Notebook，如果这有任何相关性的话。这是我的完整程序，因为其他地方可能有一些我忽略的问题。

import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get("https://boards.4chan.org/g/")
soup = BeautifulSoup(page.content, 'html.parser')
board = soup.find('div', class_='board')
for br in board.select('br'):
br.replace_with('n')
message = [obj.get_text() for obj in board.select('.opContainer .postMessage')]
image = [obj['href'] for obj in board.select('.opContainer .fileThumb')]
pid = [obj.get_text() for obj in board.select('.opContainer .postInfo .postNum a[title="Reply to this post"]')]
time = [obj.get_text() for obj in board.select('.opContainer .postInfo .dateTime')]
for x in range(len(image)):
image[x] = "https:" + image[x]
post = pd.DataFrame({
"ID": pid,
"Time": time,
"Image": image,
"Message": message,
})
post
pd.options.display.max_rows
pd.set_option('display.max_colwidth', -1)
display(post)

任何建议将不胜感激。感谢您的阅读。

刚刚尝试过，它对我有用，我的 bs4 版本是 4.8.0，我使用的是 Python 3.5.3，例：

from bs4 import BeautifulSoup
soup = BeautifulSoup('hello<br>world')
for br in soup('br'):
br.replace_with('n')
# <br> was replaced with n successfully
assert str(soup) == '<html><body><p>hellonworld</p></body></html>'
# get_text() also works as expected
assert soup.get_text() == 'hellonworld' 
# it is a n not a \n 
assert soup.get_text() != 'hello\nworld'

我不习惯使用 Jupyter Notebook，但似乎您的问题是，无论您使用什么来可视化数据，都会向您显示字符串表示形式，而不是实际打印字符串，希望这有帮助，问候亚行

与其在转换为汤后替换，不如尝试在转换前替换 标签。喜欢

soup = BeautifulSoup(str(page.content).replace(' ', 'n'), 'html.parser')

希望这有帮助！干杯！

PS：我没有得到任何合乎逻辑的理由，为什么换成汤后这不起作用。

在尝试了

page = requests.get("https://boards.4chan.org/g/")
str_page = page.content.decode()
str_split = 'n<'.join(str_page.split('<'))
str_split = '>n'.join(str_split.split('>'))
str_split = str_split.replace('n', '')
str_split = str_split.replace('<br>', ' ')
soup = BeautifulSoup(str_split.encode(), 'html.parser')

在两个小时的大部分时间里，我确定 Panda 数据帧将换行符打印为字符串文本。其他一切都表明该程序正在按预期工作，因此我认为这一直是问题所在。

由于某种原因直接替换为换行符不适用于BS4，您必须首先替换为其他一些唯一字符(最好是字符序列(，然后用换行符替换文本中的该序列。

试试这个。

for br in soup.find_all('br'): br.replace_with('+++')
text=soup.get_text().replace('+++','n)

<br> 使用 bs4 用换行符替换标记时出现问题

相关内容

最新更新

热门标签：