使用BeautifulSoup返回网页中表中显示的值(PANDAS读取HTML)



我只想返回杂货零售商网站上显示的价格。

我已经在网站上刮了桌子,但我只想在数据框架中的每个单元格中都有交付的价格。我的想法是过滤每个单元格,并以单元格中字符串中的价格返回正则匹配。我不确定是否有更简单的方法可以做到这一点,也许是使用pd._html?

import requests
import pandas as pd
from bs4 import BeautifulSoup
postcode = 'l4 0th'
payload = {'postcode': postcode}
putUrl = 'https://www.sainsburys.co.uk/gol-api/v1/customer/postcode'
Sains_url = 'https://www.sainsburys.co.uk/shop/PostCodeCheckSuccessView'
Sains_url2 = 'https://www.sainsburys.co.uk/shop/BookingDeliverySlotDisplayView'
client = requests.Session()
PutReq = client.put(putUrl, data=payload)
rget = client.get(Sains_url)
r2 = client.get(Sains_url2)
soup = BeautifulSoup(r2.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table), skiprows=([1]))[0]
df = df[~df.Time.str.contains("Afternoon delivery")]
df = df[~df.Time.str.contains("Evening delivery")]

我的数据框应该看起来像这样:

+-------------+----------------+-------------+-------------+
|    Time     |     Today      | Wed 26 June | Thu 27 June |
+-------------+----------------+-------------+-------------+
| 7.30-8:30am | Not Available  | £3          | £5          |
+-------------+----------------+-------------+-------------+

iiuc,您可以使用regexapplymap进行一些后处理:

import re
pat = re.compile('£S+')
# Where this regex will extract '£' and every proceeding character
# upto the next whitespace
df.applymap(lambda x: re.findall(pat, str(x))[0] if '£' in str(x) else x)

[out]

                 Time          Today    Wed  26 Jun Thu  27 Jun Fri  28 Jun  
0     7:30am - 8:30am  Not Available  Not Available       £4.50          £7   
1     8:00am - 9:00am  Not Available             £3       £5.50          £6   
2     8:30am - 9:30am  Not Available             £3       £5.50          £6   
3    9:00am - 10:00am  Not Available             £3       £4.50          £6   
4    9:30am - 10:30am  Not Available             £3       £4.50          £6   
5   10:00am - 11:00am  Not Available          £2.50       £3.50          £5   
6   11:00am - 12:00pm  Not Available          £1.50       £2.50          £4   
8    12:00pm - 1:00pm  Not Available             £1          £2          £3   
9     1:00pm - 2:00pm  Not Available          £0.50          £2       £2.50   
10    2:00pm - 3:00pm  Not Available          £0.50          £3       £2.50   
11    3:00pm - 4:00pm  Not Available          £0.50          £3       £3.50   
12    4:00pm - 5:00pm  Not Available             £1          £3       £4.50   
13    4:30pm - 5:30pm  Not Available             £1          £3       £4.50   
15    5:00pm - 6:00pm  Not Available             £1       £3.50       £4.50   
16    5:30pm - 6:30pm  Not Available             £1       £3.50       £4.50   
17    6:00pm - 7:00pm  Not Available  Not Available       £2.50          £4   
18    6:30pm - 7:30pm  Not Available  Not Available       £2.50          £4   
19    7:00pm - 8:00pm  Not Available  Not Available       £2.50          £4   
20    7:30pm - 8:30pm  Not Available  Not Available       £2.50          £4   
21    8:00pm - 9:00pm  Not Available  Not Available       £1.50          £2   
22   9:00pm - 10:00pm  Not Available          £1.50          £1       £1.50   
23  10:00pm - 11:00pm  Not Available             £1       £0.50       £1.50   
      Sat  29 Jun    Sun  30 Jun Mon  1 Jul  
0           £6.50  Not Available      £5.50  
1              £7             £7      £5.50  
2              £7             £7      £5.50  
3              £7             £7         £5  
4              £7             £7         £5  
5           £5.50          £5.50      £4.50  
6           £5.50             £5      £2.50  
8           £3.50          £3.50         £2  
9              £3          £3.50      £1.50  
10             £3          £2.50         £3  
11          £3.50             £3      £2.50  
12          £3.50          £3.50         £4  
13          £3.50          £3.50         £4  
15             £3          £2.50         £4  
16             £3          £2.50         £4  
17             £3             £3         £3  
18             £3             £3         £3  
19             £3             £3         £3  
20             £3             £3         £3  
21             £2             £2         £1  
22             £2             £2         £1  
23  Not Available  Not Available      £0.50  

如果lambdas不是您的事,这将类似于更明确的:

def extract_cost(string):
    if '£' in string:
        return re.findall('£S+', string)[0]
    else:
        return string
df.applymap(extract_cost)

其中applymap这里只是将函数extract_cost应用于DataFrame

中的每个值

最新更新