我正试图从网络上抓取数据,而这样做有不寻常的字符出现在我的数据(即'rnrn')。目标是获得一个包含站点数据的数据框。
这是我的代码
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url = "https://www.hubertiming.com/results/2018MLK"
html = urlopen(url)
soup = BeautifulSoup(html, "lxml")
title = soup.title
print(title)
print(title.text)
links = soup.find_all('a', href = True)
for link in links:
print(link['href'])
data = []
allrows = soup.find_all("tr")
for row in allrows:
row_list = row.find_all("td")
dataRow = []
for cell in row_list:
dataRow.append(cell.text)
data.append(dataRow)
print(data)
我得到的输出如下:
[[], ['Finishers:', '191'], ['Male:', '78'], ['Female:', '113'], [], ['1', '1191', 'rnrn MAX RANDOLPHrnrn ', 'M', '29', 'WASHINGTON', 'DC', '5:25', '16:48', 'rnrn 1 of 78rnrn ', 'M 21-39', 'rnrn 1 of 33rnrn ', '0:08', '16:56'], ['2', '1080', 'rnrn NEED NAME KAISER RUNNERrnrn ', 'M', '25', 'PORTLAND', 'OR', '5:39', '17:31', 'rnrn 2 of 78rnrn ', 'M 21-39', 'rnrn 2 of 33rnrn ', '0:09', '17:40'], ['3', '1275', 'rnrn DAN FRANEKrnrn ', 'M', '52', 'PORTLAND', 'OR', '5:53', '18:15', 'rnrn 3 of 78rnrn ', 'M 40-54', 'rnrn 1 of 27rnrn ', '0:07', '18:22'], ['4', '1223', 'rnrn PAUL TAYLORrnrn ', 'M', '54', 'PORTLAND', 'OR', '5:58', '18:31', 'rnrn 4 of 78rnrn ', 'M 40-54', 'rnrn 2 of 27rnrn ', '0:07', '18:38'], ['5', '1245', 'rnrn THEO KINMANrnrn ', 'M', '22', '', '', '6:17', '19:31', 'rnrn 5 of 78rnrn ', 'M 21-39', 'rnrn 3 of 33rnrn ', '0:09', '19:40'], ['6', '1185', 'rnrn MELISSA GIRGISrnrn ', 'F', '27', 'PORTLAND', 'OR', '6:20', '19:39', 'rnrn 1 of 113rnrn ', 'F 21-39', 'rnrn 1 of 53rnrn ', '0:07', '19:46'],...
df = pd.DataFrame(data)
print(df)
数据帧如下:
0 1 2
0 None None None
1 Finishers: 191 None
2 Male: 78 None
3 Female: 113 None
4 None None None
.. ... ... ...
191 187 1254 rnrn CYNTHIA HARRISrn...
192 188 1085 rnrn EBONY LAWRENCErn...
193 189 1170 rnrn ANTHONY WILLIAMSr...
194 190 2087 rnrn LEESHA POSEYrnr...
195 191 1216 rnrn ZULMA OCHOArnr...
3 4 5 6 7 8
0 None None None None None None
1 None None None None None None
2 None None None None None None
3 None None None None None None
4 None None None None None None
.. ... ... ... ... ... ...
191 F 64 PORTLAND OR 21:53 1:07:51
192 F 30 PORTLAND OR 22:00 1:08:12
193 M 39 PORTLAND OR 22:19 1:09:11
194 F 43 PORTLAND OR 30:17 1:33:53
195 F 40 GRESHAM OR 33:22 1:43:27
9 10
0 None None
1 None None
2 None None
3 None None
4 None None
.. ... ...
191 rnrn 110 of 113rnrn... F 55+
192 rnrn 111 of 113rnrn... F 21-39
193 rnrn 78 of 78rnrn ... M 21-39
194 rnrn 112 of 113rnrn... F 40-54
195 rnrn 113 of 113rnrn... F 40-54
11 12 13
0 None None None
1 None None None
2 None None None
3 None None None
4 None None None
.. ... ... ...
191 rnrn 14 of 14rnrn ... 1:19 1:09:10
192 rnrn 53 of 53rnrn ... 0:58 1:09:10
193 rnrn 33 of 33rnrn ... 0:08 1:09:19
194 rnrn 36 of 37rnrn ... 0:00 1:33:53
195 rnrn 37 of 37rnrn ... 0:00 1:43:27
[196 rows x 14 columns]
我似乎不明白如何从我的数据中删除额外的字符。请建议一个同样的方法。
@SergeyK也提到,我建议使用pandas
,这是常见的做法,并将在大多数情况下工作(bs4在引子下),你得到你的结果在一行
pd.read_html(url)[1] print(df)
如果你喜欢自己的方式,选择更具体和strip()
提到的文本:
for row in soup.select('#individualResults tr:has(td)'):
row_list = row.find_all("td")
dataRow = []
for cell in row_list:
dataRow.append(cell.text.strip())
data.append(dataRow)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
soup = BeautifulSoup(urlopen('https://www.hubertiming.com/results/2018MLK'))
data = []
for row in soup.select('#individualResults tr:has(td)'):
row_list = row.find_all("td")
dataRow = []
for cell in row_list:
dataRow.append(cell.text.strip())
data.append(dataRow)
pd.DataFrame(data, columns=[h.text for h in soup.select('#individualResults th')])
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
soup = BeautifulSoup(urlopen('https://www.hubertiming.com/results/2018MLK'))
data = []
for row in soup.select('#individualResults tr:has(td)'):
row_list = row.find_all("td")
dataRow = []
for cell in row_list:
dataRow.append(cell.text.strip())
data.append(dataRow)
pd.DataFrame(data, columns=[h.text for h in soup.select('#individualResults th')])