无效架构:未找到 "link" 的连接适配器?



我有一个包含多个链接的数据集,我正试图使用下面的代码获取所有链接的文本,但我收到了一条错误消息"InvalidSchema:找不到"的连接适配器https://en.wikipedia.org/wiki/Wagner_Group";。

数据集:

links
'https://en.wikipedia.org/wiki/Wagner_Group'
'https://en.wikipedia.org/wiki/Vladimir_Putin'
'https://en.wikipedia.org/wiki/Islam_in_Russia'

我用来刮网页的代码是:

def get_data(url): 
page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')
text = ""
for paragraph in soup.find_all('p'):
text += paragraph.text
return(text)
#works fine
url = 'https://en.wikipedia.org/wiki/M142_HIMARS'
get_data(url)
#Doesn't work
df['links'].apply(get_data)
Error: InvalidSchema: No connection adapters were found for "'https://en.wikipedia.org/wiki/Wagner_Group'"

提前感谢

#当我将其应用于单个url时,它工作得很好,但当我应用时,它不起作用将其转换为数据帧。

df['links'].apply(get_data)与请求和bs4不兼容。您可以尝试以下正确的方法之一:

示例:

import requests
from bs4 import BeautifulSoup
import pandas as pd
links =[
'https://en.wikipedia.org/wiki/Wagner_Group',
'https://en.wikipedia.org/wiki/Vladimir_Putin',
'https://en.wikipedia.org/wiki/Islam_in_Russia']

data = []
for url in links:
req = requests.get(url)
soup = BeautifulSoup(req.text,'lxml')

for pra in soup.select('div[class="mw-parser-output"] > table~p'):
paragraph = pra.get_text(strip=True)
data.append({
'paragraph':paragraph
})
#print(data)
df = pd.DataFrame(data)
print(df)

输出:

paragraph
0    TheWagner Group(Russian:Группа Вагнера,romaniz...
1    The group came to global prominence during the...
2    Because it often operates in support of Russia...
3    The Wagner Group first appeared in Ukraine in ...
4    The Wagner Group itself was first active in 20...
..                                                 ...
440  A record 18,000 Russian Muslim pilgrims from a...
441  For centuries, theTatarsconstituted the only M...
442  A survey published in 2019 by thePew Research ...
443         Percentage of Muslims in Russia by region:
444  According to the 2010 Russian census, Moscow h...
[445 rows x 1 columns]

最新更新