Python正则表达式HTML

我快疯了，我希望有人能帮助我。

我正在尝试正则化此url：https://www.reddit.com/r/spacex/?count=50&在=t3_xxxxxxx之后，其中x是数字和字母。

url来自HTML文件：

https://www.reddit.com/r/spacex/?count=25&在=t3_319905 之后

我试过这个：

re.search(r'(<a href=")(https://www.reddit.com/r/spacex/?count=25.+?)(")', subreddit).group(2)

但我一直得到NoneType对象没有属性group。

使用HTML解析器，如BeautifulSoup。它为您提供了一种指定正则表达式以匹配属性值的方法：

soup.find_all('a', href=re.compile("after=t3_w+"))

工作示例：

import re
from bs4 import BeautifulSoup
import requests
url = "https://www.reddit.com/r/spacex/?count=25&after=t3_319905"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content)
print soup.find_all('a', href=re.compile("after=t3_w+"))

另请参阅regex+HTML问题的必提供链接：

RegEx匹配除XHTML自包含标记之外的开放标记

?是正则表达式中的一个特殊字符，它使上一个标记成为可选的。您需要在正则表达式中转义?，以便匹配文字?字符。您也需要逃离这些点，但不需要逃离.+?中的那个点。

re.search(r'(<a href=")(https://www.reddit.com/r/spacex/?count=25.+?)(")', subreddit).group(2)
                                                          ^
                                                          |

这里不需要额外的捕获组。仅仅一个捕捉组就足够了。

re.search(r'<a href="(https://www.reddit.com/r/spacex/?count=25.+?)"', subreddit).group(1)

相关内容

最新更新

热门标签：