美汤:排除不需要的部分

我意识到这可能是一个非常具体的问题，但我正在努力摆脱使用以下代码获得的某些文本部分。我需要一个纯文章文本，通过在"class"下找到"p"标签来定位它：'mol-para-with-font'。不知何故，我得到了很多其他的东西，比如作者的署名、日期戳，最重要的是页面上广告中的文字。检查html，我看不到它们包含相同的"类"：'mol-para-with-font'，所以我很困惑(或者也许我已经盯着它太久了......我知道这里有很多html大师，所以我会感谢您的帮助。

我的代码：

import requests
import translitcodec
import codecs
def get_text(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
# delete unwanted tags:
for s in soup(['figure', 'script', 'style', 'table']):
s.decompose()
article_soup = [s.get_text(separator="n", strip=True) for s in soup.find_all( ['p', {'class':'mol-para-with-font'}])]    
article = 'n'.join(article_soup)
text = codecs.encode(article, 'translit/one').encode('ascii', 'replace') #replace traslit with ascii
text = u"{}".format(text) #encode to unicode
print text
url = 'http://www.dailymail.co.uk/femail/article-4703718/How-Alexander-McQueen-Kate-s-royal-tours.html'
get_text(url)

只有'p'-s 和class="mol-para-with-font"？这将给你：

import requests
from bs4 import BeautifulSoup as BS
url = 'http://www.dailymail.co.uk/femail/article-4703718/How-Alexander-McQueen-Kate-s-royal-tours.html'
r = requests.get(url)
soup = BS(r.content, "lxml")
for i in soup.find_all('p', class_='mol-para-with-font'):
print(i.text)

相关内容

最新更新

热门标签：