Python - 网页抓取 Pubmed 摘要 - 想要在部分之间创建 2 个换行符(用所有大写字母和":"分隔) - Python - web scraping Pubmed abstract - want to create 2 line breaks between sections (separated by all caps and ":") 小贝子编程网

我想从pubmed.gov中进行网络刮擦摘要，并在每个部分之间创建线断裂/一个段落，以免全部处理在一起。这些部分通常在所有帽子上，然后是结肠。示例：简介：或摘要：或方法。

我想解析每个部分，并在每个部分之间创建2个线路断裂。

我现在得到的：简介：等等等等。方法：我们进行了一个实验进行X。结论：这是一个很棒的实验。

所需的输出：

简介：等等等等。

方法：我们进行了一个实验进行X。

结论：这是一个很棒的实验。

重要说明：标题并不总是相同的，但始终是所有帽子，然后是双重结肠。因此，我想我需要弄清楚如何使用Regex来寻找带有呼叫帽和一个双重冒号的单词，并创建2个线路。

import pandas as pd
import requests
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen
import datetime
import csv
import time
listofa_urls = ['https://www.ncbi.nlm.nih.gov/pubmed/30470520', 
'https://www.ncbi.nlm.nih.gov/pubmed/31063262','https://www.ncbi.nlm.nih.gov/pubmed/31067303']
for l in listofa_urls:
    response = requests.get(l)
    soup = BeautifulSoup(response.content, 'html.parser')
    x = soup.find(class_='abstr').get_text()
    #print(x.encode("utf-8"))
    x = re.sub(r"babstract(.*?)", r"1", x, flags=re.I)
    print(x.encode("utf-8"))
    print()

仅在此3 URL上改进了您的代码。

import requests
from bs4 import BeautifulSoup
listofa_urls = ['https://www.ncbi.nlm.nih.gov/pubmed/30470520',
'https://www.ncbi.nlm.nih.gov/pubmed/31063262','https://www.ncbi.nlm.nih.gov/pubmed/31067303']
for l in listofa_urls:
    response = requests.get(l)
    soup = BeautifulSoup(response.content, 'html.parser')
    div_ = soup.find(class_='abstr').find('div')
    if div_.find('h4'):
        h4_ = div_.find_all('h4')
        p_ = div_.find_all('p')
    else:
        h4_ = soup.find(class_='abstr').find_all('h3')
        p_ = soup.find(class_='abstr').find_all('p')
    mp = list(map(lambda x, y: [x.get_text(),y.get_text()], h4_, p_))
    print(mp)
    print()

Python - 网页抓取 Pubmed 摘要 - 想要在部分之间创建 2 个换行符(用所有大写字母和":"分隔)

相关内容

最新更新

热门标签：