如何在包含各种字符串的列表中删除'rnrn'字符,同时使用 python 中的 BeautifulSoup 进行网页抓取?



我正试图从网络上抓取数据,而这样做有不寻常的字符出现在我的数据(即'rnrn')。目标是获得一个包含站点数据的数据框。

这是我的代码

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url = "https://www.hubertiming.com/results/2018MLK"  
html = urlopen(url)    
soup = BeautifulSoup(html, "lxml")
title = soup.title
print(title)
print(title.text)
links = soup.find_all('a', href = True)
for link in links:
print(link['href'])
data = []
allrows = soup.find_all("tr")
for row in allrows:
row_list = row.find_all("td")
dataRow = []
for cell in row_list:
dataRow.append(cell.text)
data.append(dataRow)

print(data)

我得到的输出如下:

[[], ['Finishers:', '191'], ['Male:', '78'], ['Female:', '113'], [], ['1', '1191', 'rnrn                    MAX RANDOLPHrnrn                ', 'M', '29', 'WASHINGTON', 'DC', '5:25', '16:48', 'rnrn                    1 of 78rnrn                ', 'M 21-39', 'rnrn                    1 of 33rnrn                ', '0:08', '16:56'], ['2', '1080', 'rnrn                    NEED NAME KAISER RUNNERrnrn                ', 'M', '25', 'PORTLAND', 'OR', '5:39', '17:31', 'rnrn                    2 of 78rnrn                ', 'M 21-39', 'rnrn                    2 of 33rnrn                ', '0:09', '17:40'], ['3', '1275', 'rnrn                    DAN FRANEKrnrn                ', 'M', '52', 'PORTLAND', 'OR', '5:53', '18:15', 'rnrn                    3 of 78rnrn                ', 'M 40-54', 'rnrn                    1 of 27rnrn                ', '0:07', '18:22'], ['4', '1223', 'rnrn                    PAUL TAYLORrnrn                ', 'M', '54', 'PORTLAND', 'OR', '5:58', '18:31', 'rnrn                    4 of 78rnrn                ', 'M 40-54', 'rnrn                    2 of 27rnrn                ', '0:07', '18:38'], ['5', '1245', 'rnrn                    THEO KINMANrnrn                ', 'M', '22', '', '', '6:17', '19:31', 'rnrn                    5 of 78rnrn                ', 'M 21-39', 'rnrn                    3 of 33rnrn                ', '0:09', '19:40'], ['6', '1185', 'rnrn                    MELISSA GIRGISrnrn                ', 'F', '27', 'PORTLAND', 'OR', '6:20', '19:39', 'rnrn                    1 of 113rnrn                ', 'F 21-39', 'rnrn                    1 of 53rnrn                ', '0:07', '19:46'],...
df = pd.DataFrame(data)
print(df)

数据帧如下:

0     1                                                  2  
0          None  None                                               None   
1    Finishers:   191                                               None   
2         Male:    78                                               None   
3       Female:   113                                               None   
4          None  None                                               None   
..          ...   ...                                                ...   
191         187  1254  rnrn                    CYNTHIA HARRISrn...   
192         188  1085  rnrn                    EBONY LAWRENCErn...   
193         189  1170  rnrn                    ANTHONY WILLIAMSr...   
194         190  2087  rnrn                    LEESHA POSEYrnr...   
195         191  1216  rnrn                    ZULMA OCHOArnr...   
3     4         5     6      7        8  
0    None  None      None  None   None     None   
1    None  None      None  None   None     None   
2    None  None      None  None   None     None   
3    None  None      None  None   None     None   
4    None  None      None  None   None     None   
..    ...   ...       ...   ...    ...      ...   
191     F    64  PORTLAND    OR  21:53  1:07:51   
192     F    30  PORTLAND    OR  22:00  1:08:12   
193     M    39  PORTLAND    OR  22:19  1:09:11   
194     F    43  PORTLAND    OR  30:17  1:33:53   
195     F    40   GRESHAM    OR  33:22  1:43:27   
9       10  
0                                                 None     None   
1                                                 None     None   
2                                                 None     None   
3                                                 None     None   
4                                                 None     None   
..                                                 ...      ...   
191  rnrn                    110 of 113rnrn...    F 55+   
192  rnrn                    111 of 113rnrn...  F 21-39   
193  rnrn                    78 of 78rnrn  ...  M 21-39   
194  rnrn                    112 of 113rnrn...  F 40-54   
195  rnrn                    113 of 113rnrn...  F 40-54   
11    12       13  
0                                                 None  None     None  
1                                                 None  None     None  
2                                                 None  None     None  
3                                                 None  None     None  
4                                                 None  None     None  
..                                                 ...   ...      ...  
191  rnrn                    14 of 14rnrn  ...  1:19  1:09:10  
192  rnrn                    53 of 53rnrn  ...  0:58  1:09:10  
193  rnrn                    33 of 33rnrn  ...  0:08  1:09:19  
194  rnrn                    36 of 37rnrn  ...  0:00  1:33:53  
195  rnrn                    37 of 37rnrn  ...  0:00  1:43:27  
[196 rows x 14 columns]

我似乎不明白如何从我的数据中删除额外的字符。请建议一个同样的方法。

@SergeyK也提到,我建议使用pandas,这是常见的做法,并将在大多数情况下工作(bs4在引子下),你得到你的结果在一行

pd.read_html(url)[1] print(df)

如果你喜欢自己的方式,选择更具体和strip()提到的文本:

for row in soup.select('#individualResults tr:has(td)'):
row_list = row.find_all("td")
dataRow = []
for cell in row_list:
dataRow.append(cell.text.strip())
data.append(dataRow)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
soup = BeautifulSoup(urlopen('https://www.hubertiming.com/results/2018MLK'))
data = []
for row in soup.select('#individualResults tr:has(td)'):
row_list = row.find_all("td")
dataRow = []
for cell in row_list:
dataRow.append(cell.text.strip())
data.append(dataRow)

pd.DataFrame(data, columns=[h.text for h in soup.select('#individualResults th')])