无法使用Beautiful Soup和Python正确抓取描述



我正在抓取这个链接:https://www.americanexpress.com/in/credit-cards/smart-earn-credit-card/?linknav=in-amex-cardshop-allcards-learn-SmartEarnCreditCard-carousel使用bs4和python。

我基本上是从使用以下代码的网站抓取关键的好处。

link = 'https://www.americanexpress.com/in/credit-cards/smart-earn-credit-card/?linknav=in-amex-cardshop-allcards-learn-SmartEarnCreditCard-carousel'
html = urlopen(link)
soup = BeautifulSoup(html, 'lxml')
details = []
for span in soup.select(".why-amex__subtitle span"):
details.append(f'{span.get_text(strip=True)}: {span.find_next("span").get_text(strip=True)}')

print(details)

['Accelerated Earn Rate: Earn 10X Membership Rewards® Points2on your spending on Flipkart and Uber and earn 5X Membership Rewards Points2on Amazon, Swiggy, BookMyShow and more.', 'Welcome Bonus: Rs. 500 cashback as Welcome Gift on eligible spends1of Rs. 10,000 in the first 90 days of Cardmembership', 'Renewal Fee Waiver: Get a renewal fee waiver on eligible spends3of Rs.40,000 and above in the previous year of Cardmembership', 'AMERICAN EXPRESS EMI: Convert purchases into']

这个列表的最后一项没有刮好,它是不完整的。因为在文本中间有一个超链接

下面是对应这个问题的html代码:
<div class="why-amex__col"><span class="icons  why-amex__lrgIcon icon-Amex-Icons-2016-85"></span><h4 class="why-amex__subtitle"><div><span>AMERICAN EXPRESS EMI</span></div></h4><div class="why-amex__copy"><div class="description_text"><div><span>Convert purchases into </span><a href="https://www.americanexpress.com/india/membershiprewards/cardmember_offers/viewmore.html" target="_blank">EMI</a><span> at the point of sale with an interest rate as low as 12% p.a. and zero foreclosure charges</span></div></div></div></div>

我想要最后一项产品的完整描述,但不能漏掉正文。

只需将innerHTML附加到details中,然后循环遍历标记以构建您的文本。

类似:


texts = []
for i, detail in enumerate(details):
texts.append('')
for tag in detail.findChildren(recursive=False):
texts[i] += tag.get_text(strip=True)

最新更新