import requests
from bs4 import BeautifulSoup
url = 'https://www.basketball-reference.com/players/a/'
urlb = 'https://www.basketball-reference.com/players/b/'
urlc = 'https://www.basketball-reference.com/players/c/'
result = requests.get(url)
doc = BeautifulSoup(result.text, 'lxml')
college = doc.find_all(string="Kentucky")
result = requests.get(urlb)
doc = BeautifulSoup(result.text, 'lxml')
collegeb = doc.find_all(string='Kentucky')
result = requests.get(urlc)
doc = BeautifulSoup(result.text, 'lxml')
collegec = doc.find_all(string='Kentucky')
print(college)
print(collegeb)
print(collegec)
我需要为至少30所学校的字母表中的每个字母做这个,我真的很想知道如何更有效地做到这一点
对几乎相同的代码进行重复删除,在输入上循环,结果的list
或dict
:
import requests
from bs4 import BeautifulSoup
url_template = 'https://www.basketball-reference.com/players/{}/'
folders = ['a', 'b', 'c'] # The only varying thing in your original tripled code
colleges = [] # Store the results for each varied thing here in same order
for folder in folders: # Loop over varying component
result = requests.get(url_template.format(folder)) # Substitute it in template
doc = BeautifulSoup(result.text, 'lxml')
colleges.append(doc.find_all(string="Kentucky")) # Append result in same order
# Loop over results to print them
for college in colleges:
print(college)
如果你让它为许多学校工作,对于字母表的每个字母,你可能会使用dict
(更好的是defaultdict
)而不是list
(这样你就可以按学校分组结果),用一个内循环按学校解析数据:
import requests
from bs4 import BeautifulSoup
from collections import defaultdict
from string import ascii_lowercase
url_template = 'https://www.basketball-reference.com/players/{}/'
folders = ascii_lowercase # Will run for every lowercase alphabet letter
schoolnames = ("Kentucky", "Gonzaga", ...)
colleges = defaultdict(list) # Store a list of results for each school
for folder in folders: # Loop over varying component
result = requests.get(url_template.format(folder)) # Substitute it in template
doc = BeautifulSoup(result.text, 'lxml')
for schoolname in schoolnames:
colleges[schoolname].append(doc.find_all(schoolname=school))
# Loop over results to print them
for collegename, results in colleges.items():
print(collegename)
for result in results:
print(result)
这里有一个稍微简单一点的代码。我所做的就是拉入所有玩家表,然后在'Colleges'
列上使用.value_counts()
。这会让你得到所有的学校。然后,如果您只想查看一所学校,只需调用索引值:
import pandas as pd
from string import ascii_lowercase
dfs_list = []
for letter in ascii_lowercase:
url = f'https://www.basketball-reference.com/players/{letter}/'
dfs_list.append(pd.read_html(url)[0])
print(url)
results = pd.concat(dfs_list, axis=0)
colleges_count = results['Colleges'].value_counts()
你甚至可以在更少的代码行中使用列表推导来转换它:
import pandas as pd
from string import ascii_lowercase
results = pd.concat([pd.read_html(f'https://www.basketball-reference.com/players/{letter}/')[0] for letter in ascii_lowercase], axis=0)
colleges_count = results['Colleges'].value_counts()
输出:
print(colleges_count)
Kentucky 112
UCLA 91
UNC 91
Duke 84
Kansas 72
Kansas, Houston 1
California Western Uiversity 1
Florida, Louisiana 1
NC State, Iona College 1
Seattle Pacific University, Washington 1
Name: Colleges, Length: 806, dtype: int64
或者只看某所学校:
print(colleges_count['Kentucky'])
112
你可以直接使用for
循环。
import requests
from bs4 import BeautifulSoup
colleges = []
for char in "abcdefghijklmnopqrstuvwxyz":
url = f"https://www.basketball-reference.com/players/{char}/"
result = requests.get(url)
doc = BeautifulSoup(result.text, 'lxml')
college = doc.find_all(string="Kentucky")
colleges.append(college)
print(*colleges, sep = "n")
你可以用你需要的指令写一个函数来"重做";每一所学校。然后,为每个学校部署一个包含每个参数或特征/特征的main()
函数。您的代码似乎是一大块行,您应该将它们分开到不同的指令中,并更多地依赖于"整齐编码">