中的每个值
我只想返回杂货零售商网站上显示的价格。
我已经在网站上刮了桌子,但我只想在数据框架中的每个单元格中都有交付的价格。我的想法是过滤每个单元格,并以单元格中字符串中的价格返回正则匹配。我不确定是否有更简单的方法可以做到这一点,也许是使用pd._html?
import requests
import pandas as pd
from bs4 import BeautifulSoup
postcode = 'l4 0th'
payload = {'postcode': postcode}
putUrl = 'https://www.sainsburys.co.uk/gol-api/v1/customer/postcode'
Sains_url = 'https://www.sainsburys.co.uk/shop/PostCodeCheckSuccessView'
Sains_url2 = 'https://www.sainsburys.co.uk/shop/BookingDeliverySlotDisplayView'
client = requests.Session()
PutReq = client.put(putUrl, data=payload)
rget = client.get(Sains_url)
r2 = client.get(Sains_url2)
soup = BeautifulSoup(r2.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table), skiprows=([1]))[0]
df = df[~df.Time.str.contains("Afternoon delivery")]
df = df[~df.Time.str.contains("Evening delivery")]
我的数据框应该看起来像这样:
+-------------+----------------+-------------+-------------+
| Time | Today | Wed 26 June | Thu 27 June |
+-------------+----------------+-------------+-------------+
| 7.30-8:30am | Not Available | £3 | £5 |
+-------------+----------------+-------------+-------------+
iiuc,您可以使用regex
和applymap
进行一些后处理:
import re
pat = re.compile('£S+')
# Where this regex will extract '£' and every proceeding character
# upto the next whitespace
df.applymap(lambda x: re.findall(pat, str(x))[0] if '£' in str(x) else x)
[out]
Time Today Wed 26 Jun Thu 27 Jun Fri 28 Jun
0 7:30am - 8:30am Not Available Not Available £4.50 £7
1 8:00am - 9:00am Not Available £3 £5.50 £6
2 8:30am - 9:30am Not Available £3 £5.50 £6
3 9:00am - 10:00am Not Available £3 £4.50 £6
4 9:30am - 10:30am Not Available £3 £4.50 £6
5 10:00am - 11:00am Not Available £2.50 £3.50 £5
6 11:00am - 12:00pm Not Available £1.50 £2.50 £4
8 12:00pm - 1:00pm Not Available £1 £2 £3
9 1:00pm - 2:00pm Not Available £0.50 £2 £2.50
10 2:00pm - 3:00pm Not Available £0.50 £3 £2.50
11 3:00pm - 4:00pm Not Available £0.50 £3 £3.50
12 4:00pm - 5:00pm Not Available £1 £3 £4.50
13 4:30pm - 5:30pm Not Available £1 £3 £4.50
15 5:00pm - 6:00pm Not Available £1 £3.50 £4.50
16 5:30pm - 6:30pm Not Available £1 £3.50 £4.50
17 6:00pm - 7:00pm Not Available Not Available £2.50 £4
18 6:30pm - 7:30pm Not Available Not Available £2.50 £4
19 7:00pm - 8:00pm Not Available Not Available £2.50 £4
20 7:30pm - 8:30pm Not Available Not Available £2.50 £4
21 8:00pm - 9:00pm Not Available Not Available £1.50 £2
22 9:00pm - 10:00pm Not Available £1.50 £1 £1.50
23 10:00pm - 11:00pm Not Available £1 £0.50 £1.50
Sat 29 Jun Sun 30 Jun Mon 1 Jul
0 £6.50 Not Available £5.50
1 £7 £7 £5.50
2 £7 £7 £5.50
3 £7 £7 £5
4 £7 £7 £5
5 £5.50 £5.50 £4.50
6 £5.50 £5 £2.50
8 £3.50 £3.50 £2
9 £3 £3.50 £1.50
10 £3 £2.50 £3
11 £3.50 £3 £2.50
12 £3.50 £3.50 £4
13 £3.50 £3.50 £4
15 £3 £2.50 £4
16 £3 £2.50 £4
17 £3 £3 £3
18 £3 £3 £3
19 £3 £3 £3
20 £3 £3 £3
21 £2 £2 £1
22 £2 £2 £1
23 Not Available Not Available £0.50
如果lambdas
不是您的事,这将类似于更明确的:
def extract_cost(string):
if '£' in string:
return re.findall('£S+', string)[0]
else:
return string
df.applymap(extract_cost)
其中applymap
这里只是将函数extract_cost
应用于DataFrame