我编写了以下代码来使用 BeautifulSoup- 从网站获取产品描述-
def get_soup(url):
try:
response = requests.get(url)
if response.status_code == 200:
html = response.content
return BeautifulSoup(html, "html.parser")
except Exception as ex:
print("error from " + url + ": " + str(ex))
def get_product_details(url):
try:
soup = get_soup(url)
prod_details = dict()
desc_list = soup.select('p ~ ul')
prod_details['description'] = ''.join(desc_list)
return prod_details
except Exception as ex:
logger.warning('%s - %s', ex, url)
if __name__ == '__main__':
get_product_details("http://www.aprisin.com.sg/p-748-littletikespoptunesguitar.html")
在上面的代码中,我正在尝试将描述(列表(转换为字符串,但问题低于-
[WARNING] aprisin.py:82 get_product_details() : sequence item 0: expected str instance, Tag found - http://www.aprisin.com.sg/p-748-littletikespoptunesguitar.html
输出描述而不将描述转换为字符串-
[<ul>
<li>Freestyle</li>
<li>Play along with 5 pre-set tunes: </li>
</ul>, <ul>
<li>Each string will play a note</li>
<li>Guitar has a whammy bar</li>
<li>2-in-1 volume control and power button </li>
<li>Simple and easy to use </li>
<li>Helps develop music appreciation </li>
<li>Requires 3 "AA" alkaline batteries (included)</li>
</ul>]
您正在将tags
(对象(列表而不是字符串传递给join()
。join()
适用于字符串列表。对连接函数使用以下代码更改:-
prod_details['description'] = ''.join([tag.get_text() for tag in desc_list])
或
prod_details['description'] = ''.join([tag.string for tag in desc_list])
如果您需要描述以及html内容,则可以使用以下方法:-
# this will preserve the html tags and indentation.
prod_details['description'] = ''.join([tag.prettify() for tag in desc_list])
或
# this will return the html content as string.
prod_details['description'] = ''.join([str(tag) for tag in desc_list])
desc_list
是bs4.element.Tag
的列表。 您应该将标签转换为字符串:
desc_list = soup.select('p ~ ul')
prod_details['description'] = str(desc_list[0])
您正在尝试联接标签列表,但联接方法需要 str 参数。尝试:
''.join([str(i) for i in desc_list])