标签后面。我一整天都尽力了。然而,我不能解决这个问题。如何解决这个问题?
我的项目旨在使用Beautifulsoup函数对所有网页文章信息进行爬网。文章信息是文章标题、时间、正文。但是,正如您所看到的,文章时间文本位于
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import pandas as pd
import requests
i = input('Start page? : ')
k = input('End page? : ')
pagenum = int(i)
lastpage = int(k)
count = int(i)
news_info = pd.DataFrame(columns=('Title', 'Datetime', 'Article'))
idx = 0
while pagenum<lastpage + 1:
url = f'http://www.koscaj.com/news/articleList.html?page={pagenum}&total=72698&box_idxno=&sc_section_code=S1N2&view_type=sm'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all(class_='list-titles')
print(f'-----{count}page result-----')
for link in links:
news_url = "http://www.koscaj.com"+link.find('a')['href']
news_link = urllib.request.urlopen(news_url).read()
soup2 = BeautifulSoup(news_link, 'html.parser')
title = soup2.find('div', {'class':'article-head-title'})
date = soup2.find('div',{'class':'info-text'})
datetime = date[1]
article = soup2.find('div', {'id':'article-view-content-div'})
news_info.loc[idx] = [title, datetime, article]
idx += 1
pagenum += 1
count += 1
print('Complete')
ya不清楚你的问题是什么。我想你是在追求这个。另外请注意,您还需要获取标题和文章的文本(因为您在代码中没有这样做(:
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import pandas as pd
import requests
i = input('Start page? : ')
k = input('End page? : ')
pagenum = int(i)
lastpage = int(k)
count = int(i)
news_info = pd.DataFrame(columns=('Title', 'Datetime', 'Article'))
idx = 0
while pagenum<lastpage + 1:
url = f'http://www.koscaj.com/news/articleList.html?page={pagenum}&total=72698&box_idxno=&sc_section_code=S1N2&view_type=sm'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all(class_='list-titles')
print(f'-----{count}page result-----')
for link in links:
news_url = "http://www.koscaj.com"+link.find('a')['href']
news_link = urllib.request.urlopen(news_url).read()
soup2 = BeautifulSoup(news_link, 'html.parser')
title = soup2.find('div', {'class':'article-head-title'})
if title:
title = soup2.find('div', {'class':'article-head-title'}).text
else:
title = ''
date = soup2.find('div',{'class':'info-text'})
try:
datetime = date.find('i', {'class':'fa fa-clock-o fa-fw'}).parent.text.strip()
except:
datetime = ''
article = soup2.find('div', {'id':'article-view-content-div'})
if article:
article = soup2.find('div', {'id':'article-view-content-div'}).text
else:
article = ''
news_info.loc[idx] = [title, datetime, article]
idx += 1
pagenum += 1
count += 1
print('Complete')
您必须访问此标记中的内部子级。
假设变量date
包含:
<div class="info-text">
<ul class="...">
<li><i class="fa fa-user-o fa-fw"></i> 전문건설신문</li>
<li><i class="fa fa-clock-o fa-fw"></i> 승인 2020.11.25 18:24</li>
...
您可以访问日期:
date.find_all('li')[1].text
将是:
승인 2020.11.25 18:24
您可以在文档中阅读更多关于访问儿童的信息。