每天刮擦网站XML代码以更新Python中的CSV文件



我知道这个基本问题的许多版本都被问到了,但是我什么也没能帮助这个特定项目。

https://www.treasury.gov/resource-center/data-chart-center/interest-rates/datasets/yield.xml

我需要刮擦本网站以获取每日财政部的收益,并用标头将其写入CSV文件。

我需要每天重复一次,因此CSV始终是最新的。

我在Python 3.6.3工作

到目前为止,我已经要编写标题了,我可以在Python中解析XML,但是我无法将标题写成CSV。

我尝试了该解决方案作为指导,并能够让标题写入。

https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-soup-and-python-3

然后,此处的其他帖子在stackexchange上帮助我阅读了XML,但是桥接两个,将XML标题和数据写入CSV,然后对其进行更新。

这是当前的代码,例如它。

# Import libraries
import csv
import requests
from bs4 import BeautifulSoup
f = csv.writer(open('treasury_yieldsV5.csv', 'w'))
f.writerow(['Date', '1 Mo', '3 Mo', '6 Mo', '1 Yr', '2 Yr', '3 Yr', '5 Yr', '7 Yr', '10 Yr' ,'20 Yr' ,'30 Yr'])
infile = open("yield.xml","r")
contents = infile.read()
soup = BeautifulSoup(contents,'xml')
titles = soup.find_all('m:properties')
for title in titles:
    print(title.get_text())
print(soup.prettify())

考虑XSLT,旨在将XML转换为其他XML,HTML甚至文本文件(CSV/TAB/JSON)的特殊用途语言。使用Python的lxml模块,您可以运行XSLT 1.0脚本消除任何for循环。否则,请python调用专用第三方XSLT处理器,例如撒克逊/Xalan;Linux/Mac的XSLTProc;或Window的.NET System.xml.xsl通过PowerShell。

xslt (另存为.xsl文件,一个特殊.xml文件)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" method="text"/>
  <xsl:strip-space elements="*"/>
  <xsl:param name="delim">,</xsl:param>
  <xsl:param name="quote">"</xsl:param>
  <xsl:template match="/QR_BC_CM">
       <!-- HEADERS -->
       <xsl:value-of select="concat($quote, 'Date', $quote, $delim, $quote, '1 Mo', $quote, $delim, $quote, '3 Mo', $quote, $delim,  
                                    $quote, '6 Mo', $quote, $delim, $quote, '1 Yr', $quote, $delim, $quote, '2 Yr', $quote, $delim,  
                                    $quote, '3 Yr', $quote, $delim, $quote, '5 Yr', $quote, $delim, $quote, '7 Yr', $quote, $delim,  
                                    $quote, '10 Yr', $quote, $delim, $quote, '20 Yr', $quote, $delim, $quote, '30 Yr', $quote)"/><xsl:text>&#xa;</xsl:text>
       <xsl:apply-templates select="LIST_G_WEEK_OF_MONTH"/>
  </xsl:template> 
  <xsl:template match="LIST_G_WEEK_OF_MONTH|G_WEEK_OF_MONTH|LIST_G_NEW_DATE|G_NEW_DATE|LIST_G_BC_CAT">
       <xsl:apply-templates select="*"/>
  </xsl:template>
  <xsl:template match="G_WEEK_OF_MONTH">
       <xsl:apply-templates select="LIST_G_NEW_DATE"/>
  </xsl:template>
  <xsl:template match="G_NEW_DATE">
       <xsl:apply-templates select="LIST_G_BC_CAT"/>
  </xsl:template>
  <xsl:template match="G_BC_CAT">
        <!-- DATA ROWS -->
        <xsl:value-of select="concat($quote, ancestor::G_NEW_DATE/BID_CURVE_DATE, $quote, $delim,
                                     $quote, BC_1MONTH, $quote, $delim, $quote, BC_3MONTH, $quote, $delim,
                                     $quote, BC_6MONTH, $quote, $delim, $quote, BC_1YEAR, $quote, $delim,
                                     $quote, BC_2YEAR, $quote, $delim, $quote, BC_3YEAR, $quote, $delim,
                                     $quote, BC_5YEAR, $quote, $delim, $quote, BC_7YEAR, $quote, $delim,
                                     $quote, BC_10YEAR, $quote, $delim, $quote, BC_20YEAR, $quote, $delim,
                                     $quote, BC_30YEAR, $quote)"/><xsl:text>&#xa;</xsl:text>
  </xsl:template>
</xsl:stylesheet>

python (直接从URL读取,将XML转换为CSV)

import requests as rq
import lxml.etree as et
# RETRIEVE WEB CONTENT
data = rq.get("https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Datasets/yield.xml")
# LOAD XML AND XSL FILES
doc = et.fromstring(data.text)
xsl = et.parse("TreasuryYields.xsl")
# TRANSFORM XML
transformer = et.XSLT(xsl)
result = transformer(doc)
# OUTPUT TO CONSOLE AND FILE
print(str(result))
with open("TreasuryYields.csv", 'w') as f:
    f.write(str(result))

输出(感恩节是美国联邦假期没有收益)

"Date","1 Mo","3 Mo","6 Mo","1 Yr","2 Yr","3 Yr","5 Yr","7 Yr","10 Yr","20 Yr","30 Yr"
"01-NOV-17","1.06","1.18","1.3","1.46","1.61","1.74","2.01","2.22","2.37","2.63","2.85"
"02-NOV-17","1.02","1.17","1.29","1.46","1.61","1.73","2","2.21","2.35","2.61","2.83"
"03-NOV-17","1.02","1.18","1.31","1.49","1.63","1.74","1.99","2.19","2.34","2.59","2.82"
"06-NOV-17","1.03","1.19","1.3","1.5","1.61","1.73","1.99","2.17","2.32","2.58","2.8"
"07-NOV-17","1.05","1.22","1.33","1.49","1.63","1.75","1.99","2.17","2.32","2.56","2.77"
"08-NOV-17","1.05","1.23","1.35","1.53","1.65","1.77","2.01","2.19","2.32","2.57","2.79"
"09-NOV-17","1.07","1.24","1.36","1.53","1.63","1.75","2.01","2.2","2.33","2.59","2.81"
"10-NOV-17","1.06","1.23","1.37","1.54","1.67","1.79","2.06","2.27","2.4","2.67","2.88"
"13-NOV-17","1.07","1.24","1.37","1.55","1.7","1.82","2.08","2.27","2.4","2.67","2.87"
"14-NOV-17","1.06","1.26","1.4","1.55","1.68","1.81","2.06","2.26","2.38","2.64","2.84"
"15-NOV-17","1.08","1.25","1.39","1.55","1.68","1.79","2.04","2.21","2.33","2.58","2.77"
"16-NOV-17","1.08","1.27","1.42","1.59","1.72","1.83","2.07","2.25","2.37","2.62","2.81"
"17-NOV-17","1.08","1.29","1.42","1.6","1.73","1.83","2.06","2.23","2.35","2.59","2.78"
"20-NOV-17","1.09","1.3","1.46","1.62","1.77","1.86","2.09","2.26","2.37","2.6","2.78"
"21-NOV-17","1.15","1.3","1.45","1.62","1.77","1.88","2.11","2.27","2.36","2.58","2.76"
"22-NOV-17","1.16","1.29","1.45","1.61","1.74","1.84","2.05","2.22","2.32","2.57","2.75"
"23-NOV-17","","","","","","","","","","",""
"24-NOV-17","1.14","1.29","1.45","1.61","1.75","1.85","2.07","2.23","2.34","2.58","2.76"

最新更新