这个想法是,检查德国医学新页面的最后3页。这些页面中的每个页面都有5个,链接到单独的文章。该程序检查,如果" href"每个都已经存在于data.csv中。如果没有,它将添加" href"。到data.csv,遵循链接并将内容保存到.html-file。
每个文章页面的内容是:
<html>
..
..
<div class="newstext">
<p> article-piece 1</p>
<p> article-piece 2</p>
<p> article-piece 3</p>
<div class="URLkastenWrapper">
<div class="newsKasten URLkasten newsKastenLinks">
<p> not wanted stuff</p>
</div>
</div>
<p> article-piece 4</p>
<p> article-piece 5</p>
</div>
我想保存"文章"。到HTML并排除"不想要的东西"。
我尝试使用recursive=False
,如我的代码所示。就我的研究而言,这是达到目标的方法,对吗?
,但由于某种原因,它行不通:(
import requests
from bs4 import BeautifulSoup
import mechanicalsoup
# this requests the first 3 news-Pages; each of them contains 5 articles
scan_med_news = ['https://www.aerzteblatt.de/nachrichten/Medizin?page=1', 'https://www.aerzteblatt.de/nachrichten/Medizin?page=2', 'https://www.aerzteblatt.de/nachrichten/Medizin?page=3']
# This function is ment to create an html-file with the Article-pices of the web-site.
def article_html_create(title, url):
with open(title+'.html', 'a+') as article:
article.write('<h1>'+title+'</h1>nn')
subpage = BeautifulSoup(requests.get(url).text, 'html5lib')
for line in subpage.select('.newstext p', recursive=False):
#this recursive:False is not working as i wish
article.write(line.text+'<br><br>')
# this piece of code takes the URLs of allready saved articles and puts them from an .csv in a list
contentlist = []
with open('data.csv', "r") as file:
for line in file:
for item in line.strip().split(','):
contentlist.append(item)
# for every article on these pages, it checks, if the url is in the contenlist, created from the date.csv
with open('data.csv', 'a') as file:
for page in scan_med_news:
doc = requests.get(page)
doc.encoding = 'utf-8'
soup = BeautifulSoup(doc.text, 'html5lib')
for h2 in soup.find_all('h2'):
for a in h2.find_all('a',):
if a['href'] in contentlist:
# if the url is already in the list, it prints "Already existing"
print('Already existing')
else:
# if the url is not in the list, it adds the url to the date.csv and starts the article_html_create-function to save the content of the article
file.write(a['href']+',')
article_html_create(a.text, 'https://www.aerzteblatt.de'+a['href'])
print('Added to the file!')
您可以选择不wanted p
节点的父 div
节点,然后将 string
属性设置为空字符串,这将使父母的子女到从汤中取出。然后,您可以定期进行选择。
示例:
In [17]: soup = BeautifulSoup(html, 'lxml')
In [18]: soup
Out[18]:
<html><body><div class="newstext">
<p> article-piece 1</p>
<p> article-piece 2</p>
<p> article-piece 3</p>
<div class="URLkastenWrapper">
<div class="newsKasten URLkasten newsKastenLinks">
<p> not wanted stuff</p>
</div>
</div>
<p> article-piece 4</p>
<p> article-piece 5</p>
</div></body></html>
In [19]: soup.select_one('.URLkastenWrapper').string = ''
In [20]: soup
Out[20]:
<html><body><div class="newstext">
<p> article-piece 1</p>
<p> article-piece 2</p>
<p> article-piece 3</p>
<div class="URLkastenWrapper"></div>
<p> article-piece 4</p>
<p> article-piece 5</p>
</div></body></html>
In [21]: soup.select('.newstext p')
Out[21]:
[<p> article-piece 1</p>,
<p> article-piece 2</p>,
<p> article-piece 3</p>,
<p> article-piece 4</p>,
<p> article-piece 5</p>]
尝试一下,看看它是否有效。只需更改:
for line in subpage.select('.newstext p', recursive=False):
#this recursive:False is not working as i wish
article.write(line.text+'<br><br>')
to
for line in subpage.select('.newstext > p '):
article.write(line.text+'<br><br>')
我的输出是(在上面使用您的HTML摘要,而print
而不是article.write
(:
文章1
文章件2
文章文件 3
文章4
文章件5