随便找网站上不存在的单词



我正在编写一个Scrapy spider,它应该可以查找网站内容(文本(中是否存在特定字符串。我有很多网站(几千个(和许多需要查找的字符串,所以我在代码中使用绑定到变量的列表。一些列表是从其他python文件导入的。

我遇到的问题是,代码似乎正在生成一个正的"0";命中";即使在使用开发工具手动检查URL后,我也无法在URL中找到字符串。下面是代码和结果示例。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from list_loop import *
import re

word_to_find = 'pharmacy'


class TestSpider(CrawlSpider):
name = 'test'
# these are lists of a lot of domains imported from another
# file called list_loop.py
allowed_domains = strip_url
start_urls = merch_url

rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)

def parse_item(self, response):
# Here I clean up the parsed text not to include /n or whitespace.
words = response.xpath("//a//text()").getall()
cleaned_words = [word.strip() for word in words]
cleaned_words = [word.lower() for word in cleaned_words if len(word) > 0]

# Then I loop through the cleaned_words in order to find a match
for single_word in cleaned_words:
re.search(r'b%sb' % word_to_find, single_word)
yield{
'Matching': 'Found the word {} in {}'.format(word_to_find, response.url)
}
else:
pass

allowed_domainsstart_urls列表中包含alibaba.com以及许多其他网站。运行spider后,我得到了这样一个结果输出:

{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},

同样的情况也发生在许多其他网站上,这些网站的内容或HTML中实际上没有"药房"一词。你知道这里出了什么问题吗?

我相信您缺少一个if语句。在您的代码中,无论是否匹配,您都会生成该句子。

for single_word in cleaned_words:
re.search(r'b%sb' % word_to_find, single_word)
yield{
'Matching': 'Found the word {} in {}'.format(word_to_find, response.url)
}

我相信你想要这样的东西:

for single_word in cleaned_words:
if re.search(r'b%sb' % word_to_find, single_word):
yield{
'Matching': 'Found the word {} in {}'.format(word_to_find, response.url)
}

相关内容

最新更新