在网站上,使用Beautifulsoup的文章时间爬行不起作用



我的项目旨在使用Beautifulsoup函数对所有网页文章信息进行爬网。文章信息是文章标题、时间、正文。但是,正如您所看到的,文章时间文本位于

  • 标签后面。我一整天都尽力了。然而,我不能解决这个问题。如何解决这个问题?
    import urllib.request
    import urllib.parse
    from bs4 import BeautifulSoup
    import pandas as pd
    import requests
    i = input('Start page? : ')
    k = input('End page? : ')
    pagenum = int(i)
    lastpage = int(k)
    count = int(i)
    news_info = pd.DataFrame(columns=('Title', 'Datetime', 'Article'))
    idx = 0
    while pagenum<lastpage + 1:
    url = f'http://www.koscaj.com/news/articleList.html?page={pagenum}&total=72698&box_idxno=&sc_section_code=S1N2&view_type=sm'
    html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html, 'html.parser')
    links = soup.find_all(class_='list-titles')
    print(f'-----{count}page result-----')
    for link in links:
    news_url = "http://www.koscaj.com"+link.find('a')['href']
    news_link = urllib.request.urlopen(news_url).read()
    soup2 = BeautifulSoup(news_link, 'html.parser')
    title = soup2.find('div', {'class':'article-head-title'})
    date = soup2.find('div',{'class':'info-text'})
    datetime = date[1]
    article = soup2.find('div', {'id':'article-view-content-div'})
    news_info.loc[idx] = [title, datetime, article]
    idx += 1
    
    pagenum += 1
    count += 1
    print('Complete')
    
  • ya不清楚你的问题是什么。我想你是在追求这个。另外请注意,您还需要获取标题和文章的文本(因为您在代码中没有这样做(:

    import urllib.request
    import urllib.parse
    from bs4 import BeautifulSoup
    import pandas as pd
    import requests
    i = input('Start page? : ')
    k = input('End page? : ')
    pagenum = int(i)
    lastpage = int(k)
    count = int(i)
    news_info = pd.DataFrame(columns=('Title', 'Datetime', 'Article'))
    idx = 0
    while pagenum<lastpage + 1:
    url = f'http://www.koscaj.com/news/articleList.html?page={pagenum}&total=72698&box_idxno=&sc_section_code=S1N2&view_type=sm'
    html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html, 'html.parser')
    links = soup.find_all(class_='list-titles')
    print(f'-----{count}page result-----')
    for link in links:
    news_url = "http://www.koscaj.com"+link.find('a')['href']
    news_link = urllib.request.urlopen(news_url).read()
    soup2 = BeautifulSoup(news_link, 'html.parser')
    title = soup2.find('div', {'class':'article-head-title'})
    if title:
    title = soup2.find('div', {'class':'article-head-title'}).text
    else:
    title = ''
    date = soup2.find('div',{'class':'info-text'})
    
    try:
    datetime = date.find('i', {'class':'fa fa-clock-o fa-fw'}).parent.text.strip()
    except:
    datetime = ''
    
    article = soup2.find('div', {'id':'article-view-content-div'})
    if article:
    article = soup2.find('div', {'id':'article-view-content-div'}).text
    else:
    article = ''
    news_info.loc[idx] = [title, datetime, article]
    idx += 1
    
    pagenum += 1
    count += 1
    print('Complete')
    

    您必须访问此标记中的内部子级。

    假设变量date包含:

    <div class="info-text">
    <ul class="...">
    <li><i class="fa fa-user-o fa-fw"></i> 전문건설신문</li>
    <li><i class="fa fa-clock-o fa-fw"></i> 승인 2020.11.25 18:24</li>
    ...
    

    您可以访问日期:

    date.find_all('li')[1].text

    将是:

    승인 2020.11.25 18:24

    您可以在文档中阅读更多关于访问儿童的信息。

    最新更新