在合并抓取的标题和正文以创建我可以在api中使用的字典时遇到麻烦



我正在尝试从wiki页面中刮出表格的头部和身体。我正在构建一个API,但目前,数据显示在一个列表中。我想把它变成一本字典。基本上,API返回如下内容:

[
[
"Bards College",
"Bards College",
"Viarmo",
"Tending the Flames"
]
]

我希望它看起来像这样:

[
[
Faction: "Bards College",
HeadQuarters: "Bards College",
Leader: "Viarmo",
Joining condition: "Tending the Flames,
inhibition condition: ""
]
]

下面是我使用BS4的抓取脚本:

scrape.py:

from bs4 import BeautifulSoup
import requests
import json

def getLinkData(link):
return requests.get(link).content
endpoint = "Factions_(Skyrim)"
#endpoint = "Holds"
content = getLinkData(f"https://elderscrolls.fandom.com/wiki/{endpoint}")
soup = BeautifulSoup(content, 'html.parser')
table = soup.find_all('table', attrs={'class': 'wikitable'})
thead = soup.find_all("th", {"class": "headerSort"})
data = []
headData = []
skyrim_data = []
for wikiTable in table:        
table_body = wikiTable.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
# Get rid of empty values
data.append([ele for ele in cols if ele])
for tableHead in thread:                         # This set of forloops is what 
#   doesn't work. 
table_head = tableHead.find('thead')
head_rows = tableHead.find('tr')
headings = head_rows.find_all('th')
for heading in headings:
#cols = heading.find_all('th')
#cols = [ele.text.strip() for ele in cols]
# Get rid of empty values
headData.append(heading)
more_data = list(filter(lambda x: x != [], headData))    
skyrim_data = list(filter(lambda x: x != [], data))

skyrim_data工作正常,它存储抓取的数据。more_data则没有。它显示为空

这里是app.py代码

from fastapi import FastAPI
from fastapi.responses import HTMLResponse
from skyscrape import skyrim_data
from skyscrape import more_data
app = FastAPI()
@app.get("/", response_class=HTMLResponse)
def home():
return("""
<html>
<head>
<title>SkyPI</title>
<link rel="stylesheet" href="style.css">
</head>
<body>
<h1>A Skyrim API</h1>
<h2>Available Endpoints:</h2>
<ul>
<a href="/factions"><li>/factions</li></a>
<a href="/factions"><li>/holds</li></a>
<a href="/factions"><li>/shouts</li></a>

</ul>
</body>
</html>
""")
@app.get("/factions")
def factions():
return skyrim_data
#return more_data

因此,这里的最终目标是将标题抓取并存储在more_data数组中然后以某种方式与skyrim_data数组结合,最终得到这样的字典API:

[
[
Faction: "Bards College",
HeadQuarters: "Bards College",
Leader: "Viarmo",
Joining condition: "Tending the Flames,
inhibition condition: ""
]
]

在抓取脚本中,我尝试使用第一组获取正文数据的for循环来创建一组获取标题数据的新for循环。然后,对于每个列表,我要组合标题数据。这就是我完全迷路的地方。请帮助!

我推荐使用Pandas。这样更容易,也更快。我将给出第一个表的示例,您可以通过更改索引对第二个表执行相同的操作。

import pandas as pd

url = 'https://elderscrolls.fandom.com/wiki/Factions_(Skyrim)'
df = pd.read_html(url)
print(df[0].to_dict(orient='records'))

输出:

[
{
"Faction":"Bards College",
"Headquarters":"Bards College",
"Leader":"Viarmo",
"Joining condition":"Tending the Flames",
"Inhibition condition":"nan"
},
{
"Faction":"Blades",
"Headquarters":"Sky Haven Temple",
"Leader":"Delphine",
"Joining condition":"A Blade in the Dark",
"Inhibition condition":"nan"
},
{
"Faction":"Greybeards",
"Headquarters":"High Hrothgar",
"Leader":"Paarthurnax",
"Joining condition":"The Horn of Jurgen Windcaller",
"Inhibition condition":"Killing Paarthurnax"
},
{
"Faction":"College of Winterhold",
"Headquarters":"The College of Winterhold",
"Leader":"Savos Aren",
"Joining condition":"First Lessons",
"Inhibition condition":"nan"
},
{
"Faction":"The Companions",
"Headquarters":"Jorrvaskr",
"Leader":"Kodlak Whitemane/The Circle",
"Joining condition":"Take Up Arms",
"Inhibition condition":"nan"
},
{
"Faction":"The Coven of Namira",
"Headquarters":"Reachcliff Cave",
"Leader":"Eola",
"Joining condition":"The Taste of Death",
"Inhibition condition":"nan"
},
{
"Faction":"House Telvanni",
"Headquarters":"Tel Mithryn",
"Leader":"Neloth",
"Joining condition":"Old FriendsDR",
"Inhibition condition":"nan"
},
{
"Faction":"Dark Brotherhood",
"Headquarters":"Falkreath Sanctuary, Dawnstar Sanctuary",
"Leader":"Astrid",
"Joining condition":"Innocence Lost",
"Inhibition condition":"Destroy the Dark Brotherhood!"
},
{
"Faction":"Imperial Legion",
"Headquarters":"Castle Dour",
"Leader":"General Tullius",
"Joining condition":"Joining the Legion",
"Inhibition condition":"Joining the Stormcloaks"
},
{
"Faction":"Nightingales",
"Headquarters":"Nightingale Hall",
"Leader":"Nocturnal",
"Joining condition":"Trinity Restored",
"Inhibition condition":"nan"
},
{
"Faction":"Stormcloaks",
"Headquarters":"Palace of the Kings",
"Leader":"Ulfric Stormcloak",
"Joining condition":"Joining the Stormcloaks",
"Inhibition condition":"Joining the Legion"
},
{
"Faction":"Thieves Guild",
"Headquarters":"The Ragged Flagon",
"Leader":"Mercer Frey",
"Joining condition":"A Chance Arrangement",
"Inhibition condition":"nan"
},
{
"Faction":"Tribal Orcs",
"Headquarters":"Dushnikh Yal, Mor Khazgur, Narzulbur, Largashbur.",
"Leader":"Chiefs Burguk, Yamarz, Larak & Mauhulakh.",
"Joining condition":"By doing quests for one of them, thus becoming Blood-Kin.",
"Inhibition condition":"nan"
},
{
"Faction":"Dawnguard",
"Headquarters":"Fort Dawnguard",
"Leader":"Isran",
"Joining condition":"DawnguardDG",
"Inhibition condition":"Becoming a vampire in "Bloodline"DG"
},
{
"Faction":"Volkihar Clan",
"Headquarters":"Castle Volkihar",
"Leader":"Harkon",
"Joining condition":"Becoming a vampire in "Bloodline"DG",
"Inhibition condition":"Refusing to become a vampire in "Bloodline"DG"
}
]

最新更新