使用 BS4 提取 HTML 页面的字符串时出现问题



我正在编写一个查找歌词的程序,该程序即将完成,但我对bs4数据类型有点问题,我的问题是如何从行尾的抒情变量中提取纯文本?

import re
import requests
import bs4
from urllib import unquote
def getLink(fileName):
    webFileName = unquote(fileName)
    page = requests.get("http://songmeanings.com/query/?query="+str(webFileName)+"&type=songtitles")    
    match = re.search('songmeanings.com/[^image].*?/"',page.content)
    if match:
        Mached = str("http://"+match.group())
        return(Mached[:-1:]) # this line used to remove a " at the end of line
    else:
        return(1)       
def getText(link):    
    page = requests.get(str(link))          
    soup = bs4.BeautifulSoup(page.content ,"lxml")     
    return(soup)        
Soup = getText(getLink("paranoid android"))
lyric = Soup.findAll(attrs={"lyric-box"})
print (lyric)

结果是:

[\n\t\t\t\t\t\t请你停止噪音好吗,
\n我正试着休息一下
\从我脑子里所有未出生的鸡叫声中听到的
\nWhat’s What?
\n\n
\n\n当我是国王时,你将是第一个靠墙的
以你的观点,这根本无关紧要
\r\nWhat’sWhat?
\n踢球和尖叫的古奇小猪
\n你不记得了
\n你不记住了
\你为什么不记得我的名字
\n砍掉他的头,伙计
\n切掉他的头,小伙子
你为什么不记得我的名字
\n我猜他是这样做的
\n
\n从一个很高的高度来的我
\,先生
\n你要走了
\n猪皮发出的噼啪声
\r\n灰尘和尖叫声
雅皮士们建立了网络
%n恐慌,呕吐物
\n\nGod爱他的孩子,
God爱他孩子,是的

\n编辑歌词\n编辑Wiki \n添加视频\n
]

附加以下代码行:

lyric = ''.join([tag.text for tag in lyric])

之后

lyric = Soup.findAll(attrs={"lyric-box"})

你会得到类似的输出

                        Please could you stop the noise,
I'm trying to get some rest
From all the unborn chicken voices in my head
What's that?
What's that?
When I am king, you will be first against the wall
With your opinion which is of no consequence at all
What's that?
What's that?
...

首先通过执行stringvar[1:-1]来修剪前导和尾随[],然后在每行调用linevar.strip(),这将去除所有空白。

对于喜欢这个想法的人来说,经过一些小的修改,我的代码最终看起来是这样的:)

import re
import pycurl
import bs4
from urllib import unquote
from StringIO import StringIO
def getLink(fileName):
    fileName = unquote(fileName)
    baseAddres = "https://songmeanings.com/query/?query="
    linkToPage = str(baseAddres)+str(fileName)+str("&type=songtitles")
    
    buffer = StringIO()
    page = pycurl.Curl()
    page.setopt(page.URL,linkToPage)
    page.setopt(page.WRITEDATA,buffer)
    page.perform()
    page.close()
    
    pageSTR = buffer.getvalue()
    
    soup = bs4.BeautifulSoup(pageSTR,"lxml")  
    
    tab_content = str(soup.find_all(attrs={"tab-content"}))    
    pattern = r'"//songmeanings.com/.+?"'
    links = re.findall(pattern,tab_content)
    
    """returns first mached item without double quote
    at the beginning and at the end of the string"""
    return("http:"+links[0][1:-1:])
    
def getText(linkToSong):
    
    buffer = StringIO()
    page = pycurl.Curl()
    page.setopt(page.URL,linkToSong)
    page.setopt(page.WRITEDATA,buffer)
    page.perform()
    page.close()
    
    pageSTR = buffer.getvalue()
    
    soup = bs4.BeautifulSoup(pageSTR,"lxml")  
    
    lyric_box = soup.find_all(attrs={"lyric-box"})
    lyric_boxSTR = ''.join([tag.text for tag in lyric_box])
    return(lyric_boxSTR)
    
    
link = getLink("Anarchy In The U.K")
text = getText(link)
print(text)

最新更新