显示刮擦时"None"的body数据?



上次我遇到这个问题时,添加了修复的Header信息——这里似乎不是这样。尝试不同的方法,但最终我的目标是从列出的每个链接上的所有表中抓取信息。

它以tbody数据的形式出现——特别是类:table-responsive.xss(我认为(。

我试着获取了所有的tbody数据,也只获取了这个类,但除了一个"列表之外,我没有得到任何结果;none";价值观

还有其他方法吗?我希望添加Header是解决方案,但似乎不是。

from requests_html import HTMLSession
from bs4 import BeautifulSoup
profiles = []
session = HTMLSession()
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
urls = [
'https://magicseaweed.com/New-Jersey-Monmouth-County-Surfing/277/',
'https://magicseaweed.com/New-Jersey-Ocean-City-Surfing/279/'
]
for url in urls:
r = session.get(url)
# wait for 3s until the page fully loaded
r.html.render(sleep=3, timeout=20)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
for profile in soup.find_all('div', attrs={"class": "table-responsive.xs"}):
profiles.append(profile)
for p in profiles:
print(p)

也尝试过:

from requests_html import HTMLSession
from bs4 import BeautifulSoup
profiles = []
session = HTMLSession()
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
urls = [
'https://magicseaweed.com/New-Jersey-Monmouth-County-Surfing/277/',
'https://magicseaweed.com/New-Jersey-Ocean-City-Surfing/279/'
]
for url in urls:
r = session.get(url)
# wait for 3s until the page fully loaded
r.html.render(sleep=3, timeout=20)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
for profile in soup.find_all('a'):
profile = profile.get('tbody')
profiles.append(profile)
for p in profiles:
print(p)

最后-

在这里有人的指导下,我可以单独使用以下脚本提取完整的json数据:

import requests
import pandas as pd
import json

r = requests.get('https://magicseaweed.com/api/mdkey/spot?&limit=-1')
df = pd.DataFrame(r.json()).to_csv('out.csv', index=False)
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

print(df)

然而,由于我住在新泽西州,我真的只关心新泽西州的海浪。我使用HREF抓取来获取我想要查看数据的URL。理想情况下,我可以收集一周的信息,但如果这一天是唯一可能的选择,我会活下来的。

我尝试包含一个只关注特定URL的if语句(它在JSON数据中(,但没有成功。最终,我想添加一个OR来包括列出的所有链接,除非有人有更好的想法?

我知道一旦提取,我可以很容易地匹配它们,但我不想每次都运行9000行,而我只需要选择几行。

import requests
import pandas as pd
import json

r = requests.get('https://magicseaweed.com/api/mdkey/spot?&limit=-1')
df = pd.DataFrame(r.json()).to_csv('out.csv', index=False)
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
for d in df:
if d and '/Belmar-Surf-Report/3683' in df:
print(d)

# '/Belmar-Surf-Report/3683'
# '/Manasquan-Surf-Report/386/'
# '/Ocean-Grove-Surf-Report/7945/'
# '/Asbury-Park-Surf-Report/857/'
# '/Avon-Surf-Report/4050/'
# '/Bay-Head-Surf-Report/4951/'
# '/Belmar-Surf-Report/3683/'
# '/Boardwalk-Surf-Report/9183/'
# '/Bradley-Beach-Surf-Report/7944/'
# '/Casino-Surf-Report/9175/'
# '/Deal-Surf-Report/822/'
# '/Dog-Park-Surf-Report/9174/'
# '/Jenkinsons-Surf-Report/4053/'
# '/Long-Branch-Surf-Report/7946/'
# '/Long-Branch-Surf-Report/7947/'
# '/Manasquan-Surf-Report/386/'
# '/Monmouth-Beach-Surf-Report/4055/'
# '/Ocean-Grove-Surf-Report/7945/'
# '/Point-Pleasant-Surf-Report/7942/'
# '/Sea-Girt-Surf-Report/7943/'
# '/Spring-Lake-Surf-Report/7941/'
# '/The-Cove-Surf-Report/385/'
# '/Belmar-Surf-Report/3683/'
# '/Avon-Surf-Report/4050/'
# '/Deal-Surf-Report/822/'
# '/North-Street-Surf-Report/4946/'
# '/Margate-Pier-Surf-Report/4054/'
# '/Ocean-City-NJ-Surf-Report/391/'
# '/7th-St-Surf-Report/7918/'
# '/Brigantine-Surf-Report/4747/'
# '/Brigantine-Seawall-Surf-Report/4942/'
# '/Crystals-Surf-Report/4943/'
# '/Longport-32nd-St-Surf-Report/1158/'
# '/Margate-Pier-Surf-Report/4054/'
# '/North-Street-Surf-Report/4946/'
# '/Ocean-City-NJ-Surf-Report/391/'
# '/South-Carolina-Ave-Surf-Report/4944/'
# '/St-James-Surf-Report/7917/'
# '/States-Avenue-Surf-Report/390/'
# '/Ventnor-Pier-Surf-Report/4945/'
# '/14th-Street-Surf-Report/9055/'
# '/18th-St-Surf-Report/9056/'
# '/30th-St-Surf-Report/9057/'
# '/56th-St-Surf-Report/9059/'
# '/Diamond-Beach-Surf-Report/9061/'
# '/Strathmere-Surf-Report/7919/'
# '/The-Cove-Surf-Report/7921/'
# '/14th-Street-Surf-Report/9055/'
# '/18th-St-Surf-Report/9056/'
# '/30th-St-Surf-Report/9057/'
# '/56th-St-Surf-Report/9059/'
# '/Avalon-Surf-Report/821/'
# '/Diamond-Beach-Surf-Report/9061/'
# '/Nuns-Beach-Surf-Report/7948/'
# '/Poverty-Beach-Surf-Report/4056/'
# '/Sea-Isle-City-Surf-Report/1281/'
# '/Stockton-Surf-Report/393/'
# '/Stone-Harbor-Surf-Report/7920/'
# '/Strathmere-Surf-Report/7919/'
# '/The-Cove-Surf-Report/7921/'
# '/Wildwood-Surf-Report/392/'
//or can use the SurfIDs:
3683
386
7945
857
4050
4951
3683
9183
7944
9175
822
9174
4053
7946
7947
386
4055
7945
7942
7943
7941
385
3683
4050
822
4946
4054
391
7918
4747
4942
4943
1158
4054
4946
391
4944
7917
390
4945
9055
9056
9057
9059
9061
7919
7921
9055
9056
9057
9059
821
9061
7948
4056
1281
393
7920
7919
7921
392

EDIT:如果你确认了你的链接列表(它们保持不变(,你可以每天检查所有链接,如下所示:

import requests
import pandas as pd
from bs4 import BeautifulSoup

id_list = [
'/Belmar-Surf-Report/3683',
'/Manasquan-Surf-Report/386/',
'/Ocean-Grove-Surf-Report/7945/',
'/Asbury-Park-Surf-Report/857/',
'/Avon-Surf-Report/4050/',
'/Bay-Head-Surf-Report/4951/',
'/Belmar-Surf-Report/3683/',
'/Boardwalk-Surf-Report/9183/',
'/Bradley-Beach-Surf-Report/7944/',
'/Casino-Surf-Report/9175/',
'/Deal-Surf-Report/822/',
'/Dog-Park-Surf-Report/9174/',
'/Jenkinsons-Surf-Report/4053/',
'/Long-Branch-Surf-Report/7946/',
'/Long-Branch-Surf-Report/7947/',
'/Manasquan-Surf-Report/386/',
'/Monmouth-Beach-Surf-Report/4055/',
'/Ocean-Grove-Surf-Report/7945/',
'/Point-Pleasant-Surf-Report/7942/',
'/Sea-Girt-Surf-Report/7943/',
'/Spring-Lake-Surf-Report/7941/',
'/The-Cove-Surf-Report/385/',
'/Belmar-Surf-Report/3683/',
'/Avon-Surf-Report/4050/',
'/Deal-Surf-Report/822/',
'/North-Street-Surf-Report/4946/',
'/Margate-Pier-Surf-Report/4054/',
'/Ocean-City-NJ-Surf-Report/391/',
'/7th-St-Surf-Report/7918/',
'/Brigantine-Surf-Report/4747/',
'/Brigantine-Seawall-Surf-Report/4942/',
'/Crystals-Surf-Report/4943/',
'/Longport-32nd-St-Surf-Report/1158/',
'/Margate-Pier-Surf-Report/4054/',
'/North-Street-Surf-Report/4946/',
'/Ocean-City-NJ-Surf-Report/391/',
'/South-Carolina-Ave-Surf-Report/4944/',
'/St-James-Surf-Report/7917/',
'/States-Avenue-Surf-Report/390/',
'/Ventnor-Pier-Surf-Report/4945/',
'/14th-Street-Surf-Report/9055/',
'/18th-St-Surf-Report/9056/',
'/30th-St-Surf-Report/9057/',
'/56th-St-Surf-Report/9059/',
'/Diamond-Beach-Surf-Report/9061/',
'/Strathmere-Surf-Report/7919/',
'/The-Cove-Surf-Report/7921/',
'/14th-Street-Surf-Report/9055/',
'/18th-St-Surf-Report/9056/',
'/30th-St-Surf-Report/9057/',
'/56th-St-Surf-Report/9059/',
'/Avalon-Surf-Report/821/',
'/Diamond-Beach-Surf-Report/9061/',
'/Nuns-Beach-Surf-Report/7948/',
'/Poverty-Beach-Surf-Report/4056/',
'/Sea-Isle-City-Surf-Report/1281/',
'/Stockton-Surf-Report/393/',
'/Stone-Harbor-Surf-Report/7920/',
'/Strathmere-Surf-Report/7919/',
'/The-Cove-Surf-Report/7921/',
'/Wildwood-Surf-Report/392/'
]
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
for x in id_list:
url = 'https://magicseaweed.com' + x
r = requests.get(url, headers=headers)
try:
soup = BeautifulSoup(r.text, 'html.parser')
dfs = pd.read_html(str(soup))
for df in dfs:
print(df)
if df.shape[0] > 50:
df.to_csv(f"{x.replace('/', '_').replace('-', '_')}.csv")
print('____________')
except Exception as e:
print(x, e)

这会为每个页面返回几个数据帧,有的多,有的少,并保存超过50行的数据帧:

0   1   2
0   Low     12:24AM     -0.05m
1   High    6:25AM  1.28m
2   Low     12:28PM     -0.01m
3   High    6:49PM  1.66m
____________
0   1
0   First Light     5:36AM
1   Sunrise     6:05AM
2   Sunset  8:00PM
3   Last Light  8:30PM
____________
Unnamed: 0  Surf    Swell Rating    Primary Swell   Primary Swell.1     Primary Swell.2     Secondary Swell     Secondary Swell.1   Secondary Swell.2   Secondary Swell.3   ...     Wind    Wind.1  Weather     Weather.1   Prob.   Unnamed: 17     Unnamed: 18     Unnamed: 19     Unnamed: 20     Unnamed: 21
0   Wednesday 10/08     Wednesday 10/08     Wednesday 10/08     Wednesday 10/08     Wednesday 10/08     Wednesday 10/08     Wednesday 10/08     Wednesday 10/08     Wednesday 10/08     Wednesday 10/08     ...     Wednesday 10/08     Wednesday 10/08     Wednesday 10/08     Wednesday 10/08     Wednesday 10/08     Wednesday 10/08     Wednesday 10/08     Wednesday 10/08     Wednesday 10/08     Wednesday 10/08
1   12am    0.5-0.8m    NaN     0.9m    6s  NaN     0.5m    9s  NaN     NaN     ...     11 11 kph   NaN     NaN     26°c    NaN     NaN     NaN     NaN     NaN     NaN
2   3am     0.3-0.5m    NaN     0.5m    9s  NaN     0.8m    6s  NaN     NaN     ...     13 17 kph   NaN     NaN     24°c    NaN     NaN     NaN     NaN     NaN     NaN
3   6am     0.2-0.3m    NaN     0.5m    9s  NaN     0.7m    6s  NaN     NaN     ...     12 16 kph   NaN     NaN     24°c    NaN     NaN     NaN     NaN     NaN     NaN
4   9am     0.3-0.6m    NaN     0.5m    9s  NaN     0.7m    6s  NaN     NaN     ...     13 16 kph   NaN     NaN     25°c    NaN     NaN     NaN     NaN     NaN     NaN
...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
121     High    11:57PM     1.34m   NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
122     First Light     5:42AM  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
123     Sunrise     6:10AM  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
124     Sunset  7:53PM  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
125     Last Light  8:21PM  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
126 rows × 22 columns
____________
0   1   2
0   Low     12:24AM     -0.05m
1   High    6:25AM  1.28m
2   Low     12:28PM     -0.01m
3   High    6:49PM  1.66m
____________
0   1
0   First Light     5:36AM
1   Sunrise     6:05AM
2   Sunset  8:00PM
3   Last Light  8:30PM
____________
0   1   2
0   Low     1:19AM  -0.13m
1   High    7:21AM  1.37m
2   Low     1:26PM  -0.06m
3   High    7:43PM  1.7m
____________
0   1
0   First Light     5:37AM
1   Sunrise     6:06AM
2   Sunset  7:59PM
3   Last Light  8:28PM
____________
0   1   2
0   Low     2:11AM  -0.18m
1   High    8:14AM  1.43m
2   Low     2:21PM  -0.09m
3   High    8:34PM  1.69m
____________
0   1
0   First Light     5:38AM
1   Sunrise     6:07AM
2   Sunset  7:58PM
3   Last Light  8:27PM
____________
0   1   2
0   Low     2:59AM  -0.21m
1   High    9:05AM  1.47m
2   Low     3:13PM  -0.09m
3   High    9:24PM  1.64m
____________
0   1
0   First Light     5:39AM
1   Sunrise     6:08AM
2   Sunset  7:57PM
3   Last Light  8:25PM
____________
0   1   2
0   Low     3:46AM  -0.2m
1   High    9:57AM  1.47m
2   Low     4:03PM  -0.06m
3   High    10:14PM     1.56m
____________
0   1
0   First Light     5:40AM
1   Sunrise     6:09AM
2   Sunset  7:55PM
3   Last Light  8:24PM
____________
0   1   2
0   Low     4:29AM  -0.15m
1   High    10:48AM     1.46m
2   Low     4:52PM  0.01m
3   High    11:05PM     1.46m
____________
0   1
0   First Light     5:41AM
1   Sunrise     6:10AM
2   Sunset  7:54PM
3   Last Light  8:23PM
____________
0   1   2
0   Low     5:12AM  -0.07m
1   High    11:39AM     1.43m
2   Low     5:42PM  0.1m
3   High    11:57PM     1.34m
____________
0   1
0   First Light     5:42AM
1   Sunrise     6:10AM
2   Sunset  7:53PM
3   Last Light  8:21PM

最新更新