循环访问包含 10 个 url 的 python 数据帧并从中提取内容(BeautifulSoup)

我有一个名为"df"的csv，有1列。我有一个标题和 10 个网址。

Col
"http://www.cnn.com"
"http://www.fark.com"
etc 
etc

这是我的错误代码

import bs4 as bs
df_link = pd.read_csv('df.csv')    
for link in df_link:
        x = urllib2.urlopen(link[0])
        new = x.read()
# Code does not even get past here as far as I checked
        soup = bs.BeautifulSoup(new,"lxml")
        for text in soup.find_all('a',href = True):
            text.append((text.get('href')))

我收到一个错误，上面写着

ValueError: unknown url type: C

我还得到了此错误的其他变体，例如

问题是，它甚至没有过去

x = urllib2.urlopen(link[0])

另一方面;这是工作代码...

url = "http://www.cnn.com"
x = urllib2.urlopen(url)
new = x.read()
soup = bs.BeautifulSoup(new,"lxml")
for link in soup.find_all('a',href = True):
    links.append((link.get('href')))

固定答案

我没有意识到你正在使用pandas，所以我说的不是很有帮助。

您希望使用 pandas 执行此操作的方法是循环访问行并从中提取信息。以下内容应该可以工作，而无需删除标头：

import bs4 as bs
import pandas as pd
import urllib2
df_link = pd.read_csv('df.csv')
for link in df_link.iterrows():
    url = link[1]['Col']
    x = urllib2.urlopen(url)
    new = x.read()
    # Code does not even get past here as far as I checked
    soup = bs.BeautifulSoup(new,"lxml")
    for text in soup.find_all('a',href = True):
        text.append((text.get('href')))

下面的原始误导性答案

看起来您的 CSV 文件的标头没有被单独处理，因此在第一次迭代到 df_link 中，link[0] 是"Col"，这不是一个有效的 URL。

固定答案

下面的原始误导性答案

相关内容

最新更新

热门标签：