我有这段代码提取了太多的文本。我试图仅从顶级内容中提取标题。
from bs4 import BeautifulSoup
import requests
r = requests.get("https://education.maharashtra.gov.in/saral/27230500360")
data = r.text
soup = BeautifulSoup(data)
soup.find("div", {"class": "top-content"})
如何提取不属于内部div 的学校名称?预期产出:
BHARATI VIDYAMANDIR HINDI NIGHT HIGH SCHOOL AND JR COLLEGE (27230500360)
更新:
是否可以将文本另存为字典?
{27230500360 : "BHARATI VIDYAMANDIR HINDI NIGHT HIGH SCHOOL AND JR COLLEGE"}
试试这个。它会让你到达那里:
from bs4 import BeautifulSoup
import requests
req = requests.get("https://education.maharashtra.gov.in/saral/27230500360")
soup = BeautifulSoup(req.text,"lxml")
for item in soup.select("#logo"):
data = ' '.join(item.text.split())
item_dict = {data.split(" ")[-1]:' '.join(data.split(" ")[:-1])}
print(item_dict)
输出:
{'(27230500360)': 'BHARATI VIDYAMANDIR HINDI NIGHT HIGH SCHOOL AND JR COLLEGE'}
您想要的文本位于div 中,ID 为 logo
text = soup.select('#logo')[0].text
print(text.strip())
输出
巴拉蒂·维迪亚曼迪尔印地语夜高中和JR学院
要获取学校名称,您可以这样做
>>> text = soup.find('div', {'id': 'logo'}).text.strip()
>>> text
'BHARATI VIDYAMANDIR HINDI NIGHT HIGH SCHOOL AND JR COLLEGE (27230500360)'
如您所见,BHARATI VIDYAMANDIR HINDI NIGHT HIGH SCHOOL AND JR COLLEGE
和(27230500360)
之间有很多空格。要删除它,您可以使用正则表达式。
>>> text = re.sub(' +', ' ', text)
>>> text
'BHARATI VIDYAMANDIR HINDI NIGHT HIGH SCHOOL AND JR COLLEGE (27230500360)'
总之
>>> re.sub(' +', ' ', soup.find('div', {'id': 'logo'}).text.strip())
'BHARATI VIDYAMANDIR HINDI NIGHT HIGH SCHOOL AND JR COLLEGE (27230500360)'