从布局混乱的网页中获取所有名称时遇到麻烦

我编写了一个脚本来解析网页中的所有移动商店名称。当我运行我的脚本时，我会得到其中的几个。如何从此刻Parkway Mobile Home Park - Alabama姓氏的页面中获取所有名称？

网页链接

这是我到目前为止尝试过的：

import requests
from bs4 import BeautifulSoup
url = "replace with above link"
r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
items = soup.select_one("table tr")
name = 'n'.join([item.get_text(strip=True) for item in items.select("td p strong") if "alabama" in item.text.lower()])
print(name)

输出如下：

Roberts Trailer Park - Alabama
Cloverleaf Trailer Park - Alabama
Longview Mobile Home Park - Alabama

尝试使用html.parser而不是lxml。另外，不要使用select_one('table tr')，请尝试使用find_all('strong')。您还需要删除多余的空格和回车符。

以下代码将返回预期的 (491( 记录：

import re
import requests
from bs4 import BeautifulSoup
url = "http://www.chattelmortgage.net/Alabama_mobile_home_parks.html"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
items = soup.find_all('strong')
name = 'n'.join([re.sub('s{2,}', ' ', re.sub('[rn]', '', item.text)).strip() for item in items if 'alabama' in item.text.lower()])
print(name)

页面的html非常差，所以它很丑陋，但可以工作：

import requests
from bs4 import BeautifulSoup
url = "http://www.chattelmortgage.net/Alabama_mobile_home_parks.html"
r = requests.get(url)
soup = BeautifulSoup(r.text,"html")
table = soup.find('table', attrs={'class':'tablebg, tableBorder'})
print([item.text.strip()  for item in table.find_all("strong") if "alabama" in item.text.lower()])

相关内容

最新更新

热门标签：