python 2在多个网站上的搜索短语，取自一个列表文件

所以我在名为"output"的文件中有以下链接列表：

https://web.archive.org/web/20180101003616/http://onet.pl
https://web.archive.org/web/20180102000139/http://onet.pl
[...]

如果您打开列表中的第一个链接，并在firefox中按"ctrl+f"，您可以找到短语"Katastrofa"。

我只想有一个脚本，它可以找到一个短语("Katastrofa"只是一个例子，我想使用argv参数，但这在这里并不重要(，打印一些成功信息并继续。。。

我被卡住了，不知道怎么做。我得到的测试脚本没有"看到"这个词("Katastrofa"(，它肯定在第一页。。。

请帮助：(

以下是我迄今为止所做的：

f = open('output', 'r')
f2 = f.readlines()
for i in f2:
r=requests.get(i)
first_page = r.text
soup = BeautifulSoup(first_page, 'html.parser')
page_soup = soup
fraza = "Katastrofa"
boxes = page_soup.body.find_all(fraza)
print(i)
print(boxes)

输出：

https://web.archive.org/web/20180101003616/http://onet.pl
[]
https://web.archive.org/web/20180102000139/http://onet.pl
[]
https://web.archive.org/web/20180103002217/http://onet.pl

如果要搜索html string中是否包含文本

for i in f2:
r=requests.get(i)
fraza = "Katastrofa"
if re.match(fraza, r.text, re.I) # ignore case
print(i)

如果要搜索包含文本的html element

for i in f2:
r=requests.get(i)
soup = BeautifulSoup(r.text, 'html.parser')
fraza = "Katastrofa"
boxes = soup.find_all(True, text=re.compile(fraza, re.I))
if boxes:
print(i)
print(boxes)

结果是最后一个子元素的列表：

https://web.archive.org/web/20180101003616/http://onet.pl
[<span class="title"> Kostaryka: Katastrofa lotnicza. Media: są ofiary  </span>, 
<span class="title"> Australia: katastrofa samolotu, są ofiary śmiertelne  </span>]

相关内容

最新更新

热门标签：