在Python中筛选结果列表



刚开始学习Python,但对Google Sheets非常熟悉——我基本上是在模仿"过滤器"函数,但在其中找不到任何内容。

我的脚本的目标是提取NBA球员的社交媒体标签(从url)。

我让它拉出所有链接,但我想清理我的代码所以基本上有一个if语句说

如果我的结果包含(https://www.facebook.com"), (https://www.twitter.com")或(https://www.instagram.com"),这将是唯一提取的信息。

现在,它看起来更像:

代码结果

这并不是世界末日,因为我可以粘贴到Google Sheet中并进行清理,但是学习这样的东西真的很好。

from bs4 import BeautifulSoup
import requests

def get_profile(url):
profiles = []
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
container = soup.find('div', attrs={'class', 'main-container'})
for profile in container.find_all('a'):
profiles.append(profile.get('href'))
for profile in profiles:
print(profile)

get_profile('https://basketball.realgm.com/player/Carmelo-Anthony/Summary/452')
get_profile('https://basketball.realgm.com/player/LeBron-James/Summary/250')

可以使用in关键字搜索子字符串。在您的示例中,您可以像这样检查每个配置文件:

if "https://www.facebook.com" in profile:
print(profile)

in如果找到子字符串则返回True。

你可以搜索列表来检查你要检查的特定href中是否存在任何项,如下所示:

from bs4 import BeautifulSoup
import requests

def get_profile(url):
profiles = []
urls_to_keep = ['https://www.facebook.com', 'https://www.twitter.com', 'https://www.instagram.com']
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
container = soup.find('div', attrs={'class', 'main-container'})
for profile in container.find_all('a'):
href = profile.get('href')
if any(word in str(href) for word in urls_to_keep):
profiles.append(href)
for profile in profiles:
print(profile)

get_profile('https://basketball.realgm.com/player/Carmelo-Anthony/Summary/452')
get_profile('https://basketball.realgm.com/player/LeBron-James/Summary/250')

您可以找到几个您需要的值。any操作符用于此操作。

from bs4 import BeautifulSoup
import requests

def get_profile(url):
profiles = []
social_networks = ["https://www.facebook.com", "https://www.twitter.com", "https://www.instagram.com"]
req = requests.get(url)
for profile in BeautifulSoup(req.text, 'html.parser').find('div', attrs={'class', 'main-container'}).find_all('a'):
if profile.get('href') and any(link in profile.get('href') for link in social_networks):
profiles.append(profile.get('href'))
return profiles

print(get_profile('https://basketball.realgm.com/player/Carmelo-Anthony/Summary/452'))
print(get_profile('https://basketball.realgm.com/player/LeBron-James/Summary/250'))

输出:

['https://www.facebook.com/CarmeloAnthony', 'https://www.instagram.com/carmeloanthony']
['https://www.facebook.com/LeBron', 'https://www.instagram.com/kingjames']

最新更新