相同的代码给出不同的输出取决于它具有列表综合还是发电机



我正在尝试清洁此网站并获取每个单词。但是,使用发电机比使用列表更多的单词。另外,这些词是不一致的。有时我有1个单词,有时没有,有时超过30个单词。我已经阅读了有关Python文档上的发电机,并查找了有关发电机的一些问题。我知道这应该没有区别。我不明白引擎盖下发生了什么。我正在使用Python 3.6。另外,我已经阅读了与列表理解的不同输出的理解者?但是我不明白这种情况。

这是发电机的第一个功能。

def text_cleaner1(website):
    '''
    This function just cleans up the raw html so that I can look at it.
    Inputs: a URL to investigate
    Outputs: Cleaned text only
    '''
    try:
        site = requests.get(url).text # Connect to the job posting
    except: 
        return   # Need this in case the website isn't there anymore or some other weird connection problem 
    soup_obj = BeautifulSoup(site, "lxml") # Get the html from the site

    for script in soup_obj(["script", "style"]):
        script.extract() # Remove these two elements from the BS4 object
    text = soup_obj.get_text() # Get the text from this
    lines = (line.strip() for line in text.splitlines()) # break into lines
    print(type(lines))
    chunks = (phrase.strip() for line in lines for phrase in line.split("  ")) # break multi-headlines into a line each
    print(type(chunks))
    def chunk_space(chunk):
        chunk_out = chunk + ' ' # Need to fix spacing issue
        return chunk_out  
    text = ''.join(chunk_space(chunk) for chunk in chunks if chunk).encode('utf-8') # Get rid of all blank lines and ends of line
    # Now clean out all of the unicode junk (this line works great!!!)

    try:
        text = text.decode('unicode_escape').encode('ascii', 'ignore') # Need this as some websites aren't formatted
    except:                                                            # in a way that this works, can occasionally throw
        return                                                         # an exception  
    text = str(text)
    text = re.sub("[^a-zA-Z.+3]"," ", text)  # Now get rid of any terms that aren't words (include 3 for d3.js)
                                             # Also include + for C++

    text = text.lower().split()  # Go to lower case and split them apart

    stop_words = set(stopwords.words("english")) # Filter out any stop words
    text = [w for w in text if not w in stop_words]

    text = set(text) # Last, just get the set of these. Ignore counts (we are just looking at whether a term existed
                            # or not on the website)
    return text

这是列表综合的第二个功能。

def text_cleaner2(website):
    '''
    This function just cleans up the raw html so that I can look at it.
    Inputs: a URL to investigate
    Outputs: Cleaned text only
    '''
    try:
        site = requests.get(url).text # Connect to the job posting
    except: 
        return   # Need this in case the website isn't there anymore or some other weird connection problem 
    soup_obj = BeautifulSoup(site, "lxml") # Get the html from the site

    for script in soup_obj(["script", "style"]):
        script.extract() # Remove these two elements from the BS4 object
    text = soup_obj.get_text() # Get the text from this
    lines = [line.strip() for line in text.splitlines()] # break into lines
    chunks = [phrase.strip() for line in lines for phrase in line.split("  ")] # break multi-headlines into a line each
    def chunk_space(chunk):
        chunk_out = chunk + ' ' # Need to fix spacing issue
        return chunk_out  
    text = ''.join(chunk_space(chunk) for chunk in chunks if chunk).encode('utf-8') # Get rid of all blank lines and ends of line
    # Now clean out all of the unicode junk (this line works great!!!)

    try:
        text = text.decode('unicode_escape').encode('ascii', 'ignore') # Need this as some websites aren't formatted
    except:                                                            # in a way that this works, can occasionally throw
        return                                                         # an exception  
    text = str(text)
    text = re.sub("[^a-zA-Z.+3]"," ", text)  # Now get rid of any terms that aren't words (include 3 for d3.js)
                                             # Also include + for C++

    text = text.lower().split()  # Go to lower case and split them apart

    stop_words = set(stopwords.words("english")) # Filter out any stop words
    text = [w for w in text if not w in stop_words]

    text = set(text) # Last, just get the set of these. Ignore counts (we are just looking at whether a term existed
                            # or not on the website)
    return text

此代码随机给我不同的结果。

text_cleaner1("https://www.indeed.com/rc/clk?jk=02ecc871f377f959&fccid=c46d0116f6e69eae") - text_cleaner2("https://www.indeed.com/rc/clk?jk=02ecc871f377f959&fccid=c46d0116f6e69eae")

生成器为 "lazy"-它不会立即执行代码,但在需要结果时将其执行。这意味着它不会立即从变量或函数中获取值,而是对变量和函数的参考。

link的示例

all_configs = [
    {'a': 1, 'b':3},
    {'a': 2, 'b':2}
]
unique_keys = ['a','b']

for x in zip( *([c[k] for k in unique_keys] for c in all_configs) ):
    print(x)
print('---')
for x in zip( *((c[k] for k in unique_keys) for c in all_configs) ):
    print(list(x))

在发电机中,另一个for环内有for循环。

内部发电机获取c的引用,而不是c中的值,以后将获得值。

稍后(当必须从发电机那里获得结果时),它会用外部发电机for c in all_configs开始执行。执行外部发电机时,它会循环并生成两个使用c的引用,而不是c的值,但是当它循环时,它也会更改c中的值 - 因此,最终您在c中使用了两个内部生成器和{'a': 2, 'b':2}的"列表"。

之后,它执行内部发电机,该发电机最终从c获得值,但在此刻c已经具有{'a': 2, 'b':2}


btw:当您在tkinter中使用按钮时,lambda中的CC_15也存在类似的问题。

最新更新