如何在python中将单元格数据拆分为不同的单元格?



我试图将此网页解析为熊猫数据框进行分析,但是该页面的设置使表只有两列可用,一列包含名称,另一列包含作为单个单元格的所有其他信息。

例如,下面的代码:

import bs4
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
url = "https://education.scripps.edu/alumni/graduate-alumni-list/index.html"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
table = soup.find('tbody')
td = table.find_all('td')
data = []
for element in td:
sub_data = []
for sub_element in element:
try:
sub_data.append(sub_element.get_text())
except:
continue
data.append(sub_data)
dataFrame = pd.DataFrame(data = data)
df = dataFrame[[1,3]]
df = df.dropna()

所以df。Iat[0,1]将包含项目、答辩年份、指导老师、论文题目和本科院校。HTML只使用"br"one_answers";strong"把这些值分开,我想知道是否有办法把这个文本分成不同的列,这样这些列将是"名称"、"程序"、"辩护年"。而不是一个单元格包含所有的信息。

非常感谢!

在try:之后和sub_data之前。在代码中添加一行,您应该将sub_element文本分割为">
"。您可以尝试以下操作:

sub_data_splitted = sub_element.get_text().split("<br>").
# After that you are able to use each field of the data i.e. 
program = sub_data_splitted[0].split(":")[1]
defense_year = sub_data_splitted[1].split(":")[1]
advisor = sub_data_splitted[2].split(":")[1]
dissertation_title = sub_data_splitted[3].split(":")[1]
ug_institution = sub_data_splitted[4].split(":")[1]

你可以这样做。

  • 您可以使用.stripped_strings()从表的每个<tr>中获取数据列表。
  • 由于您只需要值而不需要标题(如名称,防御年份等),请使用列表推导式来选择所需的值。
  • 将列表追加到数据帧。

操作步骤如下:

import requests
from bs4 import BeautifulSoup
import pandas as pd
URL = "https://education.scripps.edu/alumni/graduate-alumni-list/index.html"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "lxml")
t = soup.find('table').find('tbody')
trs = t.find_all('tr')
data = []
for i in trs:
l = [x for i,x in enumerate(list(i.stripped_strings)) if i%2 == 0]
data.append(l)
df = pd.DataFrame(data=data)
0  ...     6
0              Abbott, PhD, Jason  ...  None
1      Adam, PhD, Gregory Charles  ...  None
2         Adhikari, PhD, Pramisha  ...  None
3    Al-Bassam, PhD, Jawdat M. H.  ...  None
4        Albertshofer, PhD, Klaus  ...  None
..                            ...  ...   ...
682           Zhou, PhD, Jiacheng  ...  None
683    Zhou, PhD, Zhaohui (Sunny)  ...  None
684                Zhu, PhD, Ruyi  ...  None
685                 Zhu, PhD, Yan  ...  None
686          Zuhl, PhD, Andrea M.  ...  None
[687 rows x 7 columns]

这就是你想要做的吗?

import bs4
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
url = "https://education.scripps.edu/alumni/graduate-alumni-list/index.html"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
table = soup.find('tbody')
td = table.find_all('td')
data = {}
names = []
prev_name = None
for element in td:
sub_data = {}
for sub_element in element:
try:
data[sub_element['alt']] = {}
prev_name = sub_element['alt']
except:
sub_element = str(sub_element).replace('</strong>', '').replace('<br/>', '</strong>')
temp = BeautifulSoup(sub_element)
if len(temp.find_all('strong')) > 0:
temp = [str(i.string) for i in temp.find_all('strong') if i.string is not None]
temp = {i.split(':', 1)[0] : i.split(':', 1)[1] for i in temp if ':' in i}
data[prev_name] = temp

df = pd.DataFrame(data = data)
df = df.T.reset_index()
df.rename(columns={'index' : 'Name'}, inplace=True)

相关内容

  • 没有找到相关文章

最新更新