美汤:排除不需要的部分



我意识到这可能是一个非常具体的问题,但我正在努力摆脱使用以下代码获得的某些文本部分。我需要一个纯文章文本,通过在"class"下找到"p"标签来定位它:'mol-para-with-font'。不知何故,我得到了很多其他的东西,比如作者的署名、日期戳,最重要的是页面上广告中的文字。检查html,我看不到它们包含相同的"类":'mol-para-with-font',所以我很困惑(或者也许我已经盯着它太久了......我知道这里有很多html大师,所以我会感谢您的帮助。

我的代码:

import requests
import translitcodec
import codecs
def get_text(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
# delete unwanted tags:
for s in soup(['figure', 'script', 'style', 'table']):
s.decompose()
article_soup = [s.get_text(separator="n", strip=True) for s in soup.find_all( ['p', {'class':'mol-para-with-font'}])]    
article = 'n'.join(article_soup)
text = codecs.encode(article, 'translit/one').encode('ascii', 'replace') #replace traslit with ascii
text = u"{}".format(text) #encode to unicode
print text
url = 'http://www.dailymail.co.uk/femail/article-4703718/How-Alexander-McQueen-Kate-s-royal-tours.html'
get_text(url)

只有'p'-s 和class="mol-para-with-font"? 这将给你:

import requests
from bs4 import BeautifulSoup as BS
url = 'http://www.dailymail.co.uk/femail/article-4703718/How-Alexander-McQueen-Kate-s-royal-tours.html'
r = requests.get(url)
soup = BS(r.content, "lxml")
for i in soup.find_all('p', class_='mol-para-with-font'):
print(i.text)

最新更新