使用pd.read_html下载Pandas并行URL

我知道我可以通过以下操作从网页下载csv文件：

import pandas as pd
import numpy as np
from io import StringIO    
URL = "http://www.something.com"
data = pd.read_html(URL)[0].to_csv(index=False, header=True)
file = pd.read_csv(StringIO(data), sep=',')

现在，我想为同时更多的URL执行上述操作，就像在浏览器中打开不同的选项卡一样。换句话说，当你有不同的URL时，一种并行化的方法，而不是一次循环一个或一个。因此，我想在数据帧中包含一系列URL，然后创建一个新列，其中包含字符串"data"，每个URL对应一个字符串。

list_URL = ["http://www.something.com", "http://www.something2.com", 
"http://www.something3.com"]
df = pd.DataFrame(list_URL, columns =['URL'])    
df['data'] = pd.read_html(df['URL'])[0].to_csv(index=False, header=True)

但它给了我错误：cannot parse from 'Series'

是否有更好的语法，或者这是否意味着我不能对多个URL并行执行此操作？

您可以这样尝试：

import pandas as pd
URLS = [
"https://en.wikipedia.org/wiki/Periodic_table#Presentation_forms",
"https://en.wikipedia.org/wiki/Planet#Planetary_attributes",
]
df = pd.DataFrame(URLS, columns=["URL"])
df["data"] = df["URL"].map(
lambda x: pd.read_html(x)[0].to_csv(index=False, header=True)
)

print(df)
# Output
URL                                         data
0  https://en.wikipedia.org/wiki/Periodic_t...  0rnPart of a series on thernPeriodic...
1  https://en.wikipedia.org/wiki/Planet#Pla...  0rn"The eight known planets of the Sol...

相关内容

最新更新

热门标签：