为什么我的代码无法识别列表的所有部分



我创建了一个网站列表,我的目标是找到所有这些网站中最常用的单词(不包括常见的单词,如、as、are等(。我尽我所能设置了这个,但我意识到我的功能只是为我列表中的第一个网站提供单词。我创建了一个循环函数,但我相信我可能搞砸了什么,正在寻找修复方法。

我当前的代码是:

from bs4 import BeautifulSoup
import pandas as pd
from nltk.corpus import stopwords
import nltk
import urllib.request
list1 = ["https://en.wikipedia.org/wiki/Netflix", "https://en.wikipedia.org/wiki/Disney+", "https://en.wikipedia.org/wiki/Prime_Video",
"https://en.wikipedia.org/wiki/Apple_TV", "https://en.wikipedia.org/wiki/Hulu", "https://en.wikipedia.org/wiki/Crunchyroll",
"https://en.wikipedia.org/wiki/ESPN%2B", "https://en.wikipedia.org/wiki/HBO_Max", "https://en.wikipedia.org/wiki/Paramount%2B", 
"https://en.wikipedia.org/wiki/PBS", "https://en.wikipedia.org/wiki/Peacock_(streaming_service)", "https://en.wikipedia.org/wiki/Discovery%2B", 
"https://en.wikipedia.org/wiki/Showtime_(TV_network)", "https://en.wikipedia.org/wiki/Epix", "https://en.wikipedia.org/wiki/Starz", 
"https://en.wikipedia.org/wiki/Acorn_TV", "https://en.wikipedia.org/wiki/BritBox", "https://en.wikipedia.org/wiki/Kocowa", 
"https://en.wikipedia.org/wiki/Pantelion_Films", "https://en.wikipedia.org/wiki/Spuul"]
freq = [ ]
for i in list1:
response =  urllib.request.urlopen(i)
html = response.read()
soup = BeautifulSoup(html,'html5lib')
text = soup.get_text(strip = True)
tokens = [t for t in text.split()]
sr= stopwords.words('english')
clean_tokens = tokens[:]
for token in tokens:    
if token in stopwords.words('english'):            
clean_tokens.remove(token)
freq = nltk.FreqDist(clean_tokens)
freq1 = freq.most_common(20)
freq[i] = pd.DataFrame(freq.items(), columns=['word', 'frequency'])
freq1
print("This is for the website: ", i, "n", freq1)

freq1的输出只给了我第一个网站最受欢迎的20个单词,我正在寻找所有网站中最受欢迎单词。

程序正在覆盖行中的列表项:freq1=freq.most_common(20(

由于python提供了一种很好的方式来附加列表,请参阅(2(。对不起,编辑了您的代码,请参阅(1(、(2(和(3(的行和评论

freq = [ ]
freq_list=[] # added new structure for list, as want to avoid conflict with your variables --(1)
for i in list1:
response =  urllib.request.urlopen(i)
html = response.read()
soup = BeautifulSoup(html,'html5lib')
text = soup.get_text(strip = True)
tokens = [t for t in text.split()]
sr= stopwords.words('english')
clean_tokens = tokens[:]
for token in tokens:    
if token in stopwords.words('english'):            
clean_tokens.remove(token)
freq = nltk.FreqDist(clean_tokens)
freq1 = freq.most_common(20)
freq_list=freq_list + freq1 # python way to add list items --(2)
freq[i] = pd.DataFrame(freq.items(), columns=['word', 'frequency'])
#freq1
print("This is for the website: ", i, "n", freq1)
df=pd.DataFrame(freq_list, columns=['word', 'frequency']) # build the dataframe (3)
df.shape

最新更新