错误:使用BeautifulSoup从网站上删除列表链接时,TypeError:必须是str,而不是NoneType



我想把https://ens.dk/en/our-services/oil-and-gas-related-data/monthly-and-yearly-production刮到这个网站上。有两组链路SI unitsOil Field units

我试图从SI units中抓取链接列表,并创建了名为get_gas_links的函数

import io
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs, SoupStrainer
import re
url = "https://ens.dk/en/our-services/oil-and-gas-related-data/monthly-and-yearly-production"
first_page = requests.get(url)
soup = bs(first_page.content)
def pasrse_page(link):
print(link)
df = pd.read_html(link, skiprows=1, headers=1)
return df
def get_gas_links():
glinks=[]
gas_links = soup.find_all("a", href = re.compile("si.htm"))
for i in gas_links:
glinks.append("https://ens.dk/" + i.get("herf"))
return glinks
get_gas_links()

scrape 3 tables from every link的主要动机然而在刮表之前,我正试图刮list of links

但显示错误:TypeError: must be str, not NoneType错误图像

您以错误的方式使用了错误的正则表达式。这就是为什么汤找不到任何符合标准的链接。您可以检查以下源,并根据需要验证extracted_link。

def get_gas_links():
glinks=[]
gas_links = soup.find('table').find_all('a')
for i in gas_links:
extracted_link = i['href']
#you can validate the extracted link however you want
glinks.append("https://ens.dk/" + extracted_link)
return glinks

相关内容

最新更新