如何通过操纵URL来缠绕网络?Python 3.5



我想从网站上刮擦库存数据表在我的代码中,我生成了一系列库存符号。网站的URL finviz 使用URL的最后一部分为每个特定股票生成表(ei。https://finviz.com/quote.com/quote.ashx?t=mbot和mbot(。我想将生成的数组作为URL的最终输入输入(EI。/quote.ashx?t=mbot(从每个URL刮掉输出表,然后将刮擦信息输入CSV文件(在这种情况下为标题为" output.csv"(,这是我的代码:

import csv
import urllib.request
from bs4 import BeautifulSoup
twiturl = "https://twitter.com/ACInvestorBlog"
twitpage = urllib.request.urlopen(twiturl)
soup = BeautifulSoup(twitpage,"html.parser")
print(soup.title.text)
tweets = [i.text for i in soup.select('a.twitter-cashtag.pretty-link.js-nav b')]
print(tweets)
url_base = "https://finviz.com/quote.ashx?t="
url_list = [url_base + tckr for tckr in tweets]
fpage = urllib.request.urlopen(url_list)
fsoup = BeautifulSoup(fpage, 'html.parser')
with open('output.csv', 'wt') as file:
    writer = csv.writer(file)
    # write header row
    writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2-cp'})))
    # write body row
    writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2'}))) 

这是我的错误列表

"C:UsersTaylor .DESKTOP-0SBM378venvhelloworldScriptspython.exe" "C:/Users/Taylor .DESKTOP-0SBM378/PycharmProjects/helloworld/helloworld"
Antonio Costa (@ACInvestorBlog) | Twitter
Traceback (most recent call last):
['LINU', 'FOSL', 'LINU', 'PETZ', 'NETE', 'DCIX', 'DCIX', 'KDMN', 'KDMN', 'LINU', 'CNET', 'AMD', 'CNET', 'AMD', 'NETE', 'NETE', 'AAPL', 'PETZ', 'CNET', 'PETZ', 'PETZ', 'MNGA', 'KDMN', 'CNET', 'ITUS', 'CNET']
  File "C:/Users/Taylor .DESKTOP-0SBM378/PycharmProjects/helloworld/helloworld", line 17, in <module>
    fpage = urllib.request.urlopen(url_list)
  File "C:UsersTaylor .DESKTOP-0SBM378AppDataLocalProgramsPythonPython36-32Liburllibrequest.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "C:UsersTaylor .DESKTOP-0SBM378AppDataLocalProgramsPythonPython36-32Liburllibrequest.py", line 517, in open
    req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
Process finished with exit code 1

您将列表传递给urllib.request.urlopen((而不是字符串,仅此而已!所以你已经很亲密了。

打开 ash all 不同的URL,只需使用一个用于循环。

for url in url_list:
    fpage = urllib.request.urlopen(url)
    fsoup = BeautifulSoup(fpage, 'html.parser')
    #scrape single page and add data to list
with open('output.csv', 'wt') as file:
    writer = csv.writer(file)
    #write datalist

您将列表传递给Urlopen方法。尝试以下内容,它将从第一个URL检索数据。

fpage = urllib.request.urlopen(url_list[0])
fsoup = BeautifulSoup(fpage, 'html.parser')

最新更新