从美丽的小组中删除外部DIV标签



我正在尝试从网站上刮擦文本,但无法弄清楚如何删除外部DIV标签。代码看起来像:

import requests
from bs4 import BeautifulSoup
team_urls = 
     ['http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_21113143.html/',
   'http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.html',
   'http://www.lyricsfreak.com/e/ed+sheeran/photograph_21058341.html']
for url in team_urls:
     page = requests.get(url)
     soup = BeautifulSoup(page.text, 'html.parser')
     for e in soup.find_all('br'):
         e.replace_with('n')
     lyrics = soup.find(class_='dn')
     print(lyrics)

这给我一个输出:

<div class="dn" id="content_h">The club isn't the best place...

我想删除DIV标签。

完整代码:

import requests
from bs4 import BeautifulSoup
urls = ['http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_21113143.html/',
        'http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.html',
        'http://www.lyricsfreak.com/e/ed+sheeran/photograph_21058341.html']
for url in urls:
    page = requests.get(url)
    page.encoding = 'utf-8'
    soup = BeautifulSoup(page.text, 'html.parser')
    div = soup.select_one('#content_h')
    for e in div.find_all('br'):
        e.replace_with('n')
    lyrics = div.text
    print(lyrics)

请注意,有时使用错误的编码:

我可能疯了不要我

这就是为什么我手动设置它:page.encoding = 'utf-8'。请求文档的片段提到了这种情况:

响应内容的编码仅基于HTTP标头确定,遵循RFC 2616的字母。如果您可以利用非HTTP知识来更好地猜测编码,则应在访问此属性之前适当地设置R.Coding。

您可以使用正则表达式

import requests
import re
from bs4 import BeautifulSoup
team_urls = ['http://www.lyricsfreak.com/e/ed+sheeran/shape+of+you_21113143.html/',
             'http://www.lyricsfreak.com/e/ed+sheeran/thinking+out+loud_21083784.html',
             'http://www.lyricsfreak.com/e/ed+sheeran/photograph_21058341.html']
for url in team_urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    for e in soup.find_all('br'):
        e.replace_with('n')
    lyrics = soup.find(class_='dn')
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', lyrics.text)
    print(cleantext)

将删除&lt之间的所有内容;和>

使用python文档中提到的特殊字符

"

。 (点。(在默认模式下,这匹配除了新线以外的任何字符。如果指定了dotall标志,则匹配包括newline的任何字符。

* 导致所得的RE匹配前一个RE的0或更多重复,尽可能多地重复。AB*将与" A"," AB"或" A"匹配,然后是任意数量的" B"。

? 导致所得的RE匹配前一个RE的0或1个重复。ab?将匹配" A"或" AB"。

"

来自https://docs.python.org/3/library/re.html

最新更新