修复在python中使用BS4提取的损坏的html表



我正在解析行政文件中的html表。这很棘手,因为 html 经常被破坏,这会导致表结构不佳。下面是我加载到 pandas 数据帧中的表示例:

0   1    2     3   4         5  
0             NaN NaN  NaN   NaN NaN       NaN   
1            Name NaN  Age   NaN NaN  Position   
2    Aylwin Lewis NaN  NaN  59.0 NaN       NaN   
3    John Morlock NaN  NaN  58.0 NaN       NaN   
4  Matthew Revord NaN  NaN  50.0 NaN       NaN   
5  Charles Talbot NaN  NaN  48.0 NaN       NaN   
6      Nancy Turk NaN  NaN  49.0 NaN       NaN   
7      Anne Ewing NaN  NaN  49.0 NaN       NaN   
6  
0                                                NaN  
1                                                NaN  
2    Chairman, Chief Executive Officer and President  
3    Senior Vice President, Chief Operations Officer  
4  Senior Vice President, Chief Legal Officer, Ge...  
5  Senior Vice President and Chief Financial Officer  
6  Senior Vice President, Chief People Officer an...  
7        Senior Vice President, New Shop Development 

我编写了以下python代码来尝试修复该表:

#dropping empty rows
df = df.dropna(how='all',axis=0)
#dropping columns with more than 70% empty values
df = df.dropna(thresh =2, axis=1)
#resetting dataframe index
df = df.reset_index(drop = True)
#set found_name variable to stop the loop once it finds the name column
found_name = 0
#looping through rows to find the first one that has the word "Name" in it
for row in df.itertuples():
#only loop if we have not found a name column yet
if found_name == 0: 
#convert the row to string
text_row = str(row)
#search if there is the word "Name" in that row
if "Name" in text_row:
print("Name found in text of rows. Investigating row",row.Index," as header.")
#changing column names
df.columns = df.iloc[row.Index]
#dropping first rows
df = df.iloc[row.Index + 1 :]
#changing found_name to 1
found_name = 1
#reindex
df = df.reset_index(drop = True)
print("Attempted to clean dataframe:")
print(df) 

这是我得到的表格:

0            Name   NaN                                                NaN
0    Aylwin Lewis  59.0    Chairman, Chief Executive Officer and President
1    John Morlock  58.0    Senior Vice President, Chief Operations Officer
2  Matthew Revord  50.0  Senior Vice President, Chief Legal Officer, Ge...
3  Charles Talbot  48.0  Senior Vice President and Chief Financial Officer
4      Nancy Turk  49.0  Senior Vice President, Chief People Officer an...
5      Anne Ewing  49.0        Senior Vice President, New Shop Development

我在这里的主要问题是标题"年龄"和"位置"已经消失,因为它们与列未对齐。我使用此脚本来解析许多表,因此无法手动修复它们。此时我该怎么做才能修复数据?

不要在开始时删除几乎为空的列,我们稍后需要它们:一旦找到包含"Name"的标题行,我们收集其所有非空元素,以便在剩余数据中删除空列后将它们设置为列标题。

#dropping empty rows
df = df.dropna(how='all',axis=0)
#resetting dataframe index
df = df.reset_index(drop = True)
#set found_name variable to stop the loop once it finds the name column
found_name = 0
#looping through rows to find the first one that has the word "Name" in it
for row in df.itertuples():
#only loop if we have not found a name column yet
if found_name == 0: 
#convert the row to string
text_row = str(row)
#search if there is the word "Name" in that row
if "Name" in text_row:
print("Name found in text of rows. Investigating row",row.Index," as header.")
#collect column names
headers = [c for c in row if not pd.isnull(c)][1:]
#dropping first rows
df = df.iloc[row.Index + 1 :]
#dropping empty columns
df = df.dropna(axis=1)
#setting column names
df.columns = (headers + ['col'] * (len(df.columns) - len(headers)))[:len(df.columns)]
#changing found_name to 1
found_name = 1
#reindex
df = df.reset_index(drop = True)
print("Attempted to clean dataframe:")
print(df) 

结果:

Name   Age                                           Position
0    Aylwin Lewis  59.0    Chairman, Chief Executive Officer and President
1    John Morlock  58.0    Senior Vice President, Chief Operations Officer
2  Matthew Revord  50.0  Senior Vice President, Chief Legal Officer, Ge...
3  Charles Talbot  48.0  Senior Vice President and Chief Financial Officer
4      Nancy Turk  49.0  Senior Vice President, Chief People Officer an...
5      Anne Ewing  49.0        Senior Vice President, New Shop Development

最新更新