我正在编写一个查找歌词的程序,该程序即将完成,但我对bs4数据类型有点问题,我的问题是如何从行尾的抒情变量中提取纯文本?
import re
import requests
import bs4
from urllib import unquote
def getLink(fileName):
webFileName = unquote(fileName)
page = requests.get("http://songmeanings.com/query/?query="+str(webFileName)+"&type=songtitles")
match = re.search('songmeanings.com/[^image].*?/"',page.content)
if match:
Mached = str("http://"+match.group())
return(Mached[:-1:]) # this line used to remove a " at the end of line
else:
return(1)
def getText(link):
page = requests.get(str(link))
soup = bs4.BeautifulSoup(page.content ,"lxml")
return(soup)
Soup = getText(getLink("paranoid android"))
lyric = Soup.findAll(attrs={"lyric-box"})
print (lyric)
结果是:
[\n\t\t\t\t\t\t请你停止噪音好吗,
\n我正试着休息一下
\从我脑子里所有未出生的鸡叫声中听到的
\nWhat’s What?
\n\n
\n\n当我是国王时,你将是第一个靠墙的
以你的观点,这根本无关紧要
\r\nWhat’sWhat?
\n踢球和尖叫的古奇小猪
\n你不记得了
\n你不记住了
\你为什么不记得我的名字
\n砍掉他的头,伙计
\n切掉他的头,小伙子
你为什么不记得我的名字
\n我猜他是这样做的
\n
\n从一个很高的高度来的我
\,先生
\n你要走了
\n猪皮发出的噼啪声
\r\n灰尘和尖叫声
雅皮士们建立了网络
%n恐慌,呕吐物
\n\nGod爱他的孩子,
God爱他孩子,是的
\n编辑歌词\n编辑Wiki \n添加视频\n
]
附加以下代码行:
lyric = ''.join([tag.text for tag in lyric])
之后
lyric = Soup.findAll(attrs={"lyric-box"})
你会得到类似的输出
Please could you stop the noise,
I'm trying to get some rest
From all the unborn chicken voices in my head
What's that?
What's that?
When I am king, you will be first against the wall
With your opinion which is of no consequence at all
What's that?
What's that?
...
首先通过执行stringvar[1:-1]
来修剪前导和尾随[],然后在每行调用linevar.strip()
,这将去除所有空白。
对于喜欢这个想法的人来说,经过一些小的修改,我的代码最终看起来是这样的:)
import re
import pycurl
import bs4
from urllib import unquote
from StringIO import StringIO
def getLink(fileName):
fileName = unquote(fileName)
baseAddres = "https://songmeanings.com/query/?query="
linkToPage = str(baseAddres)+str(fileName)+str("&type=songtitles")
buffer = StringIO()
page = pycurl.Curl()
page.setopt(page.URL,linkToPage)
page.setopt(page.WRITEDATA,buffer)
page.perform()
page.close()
pageSTR = buffer.getvalue()
soup = bs4.BeautifulSoup(pageSTR,"lxml")
tab_content = str(soup.find_all(attrs={"tab-content"}))
pattern = r'"//songmeanings.com/.+?"'
links = re.findall(pattern,tab_content)
"""returns first mached item without double quote
at the beginning and at the end of the string"""
return("http:"+links[0][1:-1:])
def getText(linkToSong):
buffer = StringIO()
page = pycurl.Curl()
page.setopt(page.URL,linkToSong)
page.setopt(page.WRITEDATA,buffer)
page.perform()
page.close()
pageSTR = buffer.getvalue()
soup = bs4.BeautifulSoup(pageSTR,"lxml")
lyric_box = soup.find_all(attrs={"lyric-box"})
lyric_boxSTR = ''.join([tag.text for tag in lyric_box])
return(lyric_boxSTR)
link = getLink("Anarchy In The U.K")
text = getText(link)
print(text)