如何在HTML表中刮擦所有值,包括跨度和HREF



我有以下HTML代码:

<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>
<body>
    <table border="1" cellspacing="0" class="Quote xaltrow" id="MainContent_Quote1_Table1_Table1" style="border-collapse:collapse;border-collapse:collapse;">
        <thead>
            <tr class="xheader">
                <th>
                    <span>Sym</span> <a class="ToggleNames" href="/Analytics/MostActive.aspx">-Names</a> <span class="arrow"></span>
                </th>
                <th colspan="3">Bid - Ask</th>
                <th>Last <span class="arrow"></span></th>
                <th>Chg <span class="arrow"></span></th>
                <th>%Ch <span class="arrow"></span></th>
                <th>Vol <span class="arrow"></span></th>
                <th>$Vol <span class="arrow"></span></th>
                <th>#Tr <span class="arrow"></span></th>
                <th>Open-Hi-Lo</th>
                <th>Year Hi-Lo</th>
                <th>Last Tr</th>
                <th>News</th>
                <th>Delay</th>
            </tr>
        </thead>
        <tbody>
            <tr class="Upd UpdURHHBY-">
                <td class="sym">
                    <a class="qn Name" href="/Quote/Detail.aspx?symbol=RHHBY&amp;region=U">RHHBY</a> <span>- Q</span> <span class="Name">- ROCHE HLDG LTD SPONS</span>
                </td>
                <td class="bac" colspan="3">no orders</td>
                <td class="UpdL">31.92</td>
                <td class=" xred UpdC">-0.31</td>
                <td class="xsmall UpdCP xred">-1.0</td>
                <td class="q-regright UpdV">851.0</td>
                <td class="q-smright UpdW">27,163</td>
                <td class="xsmall UpdT">1,461</td>
                <td class="xsmall xcentre"><span class="UpdO">32.03</span>&nbsp;&nbsp;<span class="UpdH">32.067</span>&nbsp;&nbsp;<span class="UpdI">31.84</span></td>
                <td class="xsmall xcentre">33.74&nbsp;&nbsp;27.09</td>
                <td class="xsmall xcentre UpdE"></td>
                <td class="xsmall xcentre"></td>
                <td class="xsmall xcentre">realtime</td>
            </tr>
            <tr class="Upd UpdUNSRGY-">
                <td class="sym">
                    <a class="qn Name" href="/Quote/Detail.aspx?symbol=NSRGY&amp;region=U">NSRGY</a> <span>- Q</span> <span class="Name">- NESTLE SA REG SHRS S</span>
                </td>
                <td class="bac" colspan="3">no orders</td>
                <td class="UpdL">76.07</td>
                <td class=" xred UpdC">-0.23</td>
                <td class="xsmall UpdCP xred">-0.3</td>
                <td class="q-regright UpdV">336.2</td>
                <td class="q-smright UpdW">25,574</td>
                <td class="xsmall UpdT">1,785</td>
                <td class="xsmall xcentre"><span class="UpdO">75.89</span>&nbsp;&nbsp;<span class="UpdH">76.07</span>&nbsp;&nbsp;<span class="UpdI">75.66</span></td>
                <td class="xsmall xcentre">83.00&nbsp;&nbsp;66.28</td>
                <td class="xsmall xcentre UpdE"></td>
                <td class="xsmall xcentre"></td>
                <td class="xsmall xcentre">realtime</td>
            </tr>
            <tr class="Upd UpdUNTTYY-">
                <td class="sym">
                    <a class="qn Name" href="/Quote/Detail.aspx?symbol=NTTYY&amp;region=U">NTTYY</a> <span>- Q</span> <span class="Name">- NIPPON TELEGRAPH AND TELEPHONE C</span>
                </td>
                <td class="bac" colspan="3">no orders</td>
                <td class="UpdL">43.90</td>
                <td class=" xred UpdC">-0.56</td>
                <td class="xsmall UpdCP xred">-1.3</td>
                <td class="q-regright UpdV">316.2</td>
                <td class="q-smright UpdW">13,883</td>
                <td class="xsmall UpdT">889</td>
                <td class="xsmall xcentre"><span class="UpdO">44.145</span>&nbsp;&nbsp;<span class="UpdH">44.15</span>&nbsp;&nbsp;<span class="UpdI">43.89</span></td>
                <td class="xsmall xcentre">44.57&nbsp;&nbsp;43.00</td>
                <td class="xsmall xcentre UpdE"></td>
                <td class="xsmall xcentre"></td>
                <td class="xsmall xcentre">realtime</td>
            </tr>
            <tr class="Upd UpdUTCEHY-">
                <td class="sym">
                    <a class="qn Name" href="/Quote/Detail.aspx?symbol=TCEHY&amp;region=U">TCEHY</a> <span>- Q</span> <span class="Name">- TENCENT HOLDINGS ADR</span>
                </td>
                <td class="bac" colspan="3">no orders</td>
                <td class="UpdL">29.63</td>
                <td class=" xgreen UpdC">+0.06</td>
                <td class="xsmall UpdCP xgreen">0.2</td>
                <td class="q-regright UpdV">380.1</td>
                <td class="q-smright UpdW">11,263</td>
                <td class="xsmall UpdT">1,341</td>
                <td class="xsmall xcentre"><span class="UpdO">29.65</span>&nbsp;&nbsp;<span class="UpdH">29.78</span>&nbsp;&nbsp;<span class="UpdI">29.60</span></td>
                <td class="xsmall xcentre">29.85&nbsp;&nbsp;19.74</td>
                <td class="xsmall xcentre UpdE"></td>
                <td class="xsmall xcentre"></td>
                <td class="xsmall xcentre">realtime</td>
            </tr>
            <tr class="Upd UpdUATLKY-">
                <td class="sym">
                    <a class="qn Name" href="/Quote/Detail.aspx?symbol=ATLKY&amp;region=U">ATLKY</a> <span>- Q</span> <span class="Name">- ATLAS COPCO AB SER A</span>
                </td>
                <td class="bac" colspan="3">no orders</td>
                <td class="UpdL">35.46</td>
                <td class=" xred UpdC">-0.23</td>
                <td class="xsmall UpdCP xred">-0.6</td>
                <td class="q-regright UpdV">316.2</td>
                <td class="q-smright UpdW">11,213</td>
                <td class="xsmall UpdT">209</td>
                <td class="xsmall xcentre"><span class="UpdO">35.74</span>&nbsp;&nbsp;<span class="UpdH">35.81</span>&nbsp;&nbsp;<span class="UpdI">35.46</span></td>
                <td class="xsmall xcentre">35.72&nbsp;&nbsp;23.58</td>
                <td class="xsmall xcentre UpdE"></td>
                <td class="xsmall xcentre"></td>
                <td class="xsmall xcentre">realtime</td>
            </tr>
            <tr class="Upd UpdUVLKAY-">
                <td class="sym">
                    <a class="qn Name" href="/Quote/Detail.aspx?symbol=VLKAY&amp;region=U">VLKAY</a> <span>- Q</span> <span class="Name">- VOLKSWAGEN A G SPONS</span>
                </td>
                <td class="bac" colspan="3">no orders</td>
                <td class="UpdL">29.15</td>
                <td class=" xred UpdC">-0.34</td>
                <td class="xsmall UpdCP xred">-1.2</td>
                <td class="q-regright UpdV">323.6</td>
                <td class="q-smright UpdW">9,432</td>
                <td class="xsmall UpdT">782</td>
                <td class="xsmall xcentre"><span class="UpdO">28.935</span>&nbsp;&nbsp;<span class="UpdH">29.25</span>&nbsp;&nbsp;<span class="UpdI">28.90</span></td>
                <td class="xsmall xcentre">33.60&nbsp;&nbsp;25.88</td>
                <td class="xsmall xcentre UpdE"></td>
                <td class="xsmall xcentre"></td>
                <td class="xsmall xcentre">realtime</td>
            </tr>
            <tr class="Upd UpdUTMICY-">
                <td class="sym">
                    <a class="qn Name" href="/Quote/Detail.aspx?symbol=TMICY&amp;region=U">TMICY</a> <span>- Q</span> <span class="Name">- TREND MICRO ADR #</span>
                </td>
                <td class="bac" colspan="3">no orders</td>
                <td class="UpdL">42.78</td>
                <td class=" xred UpdC">-0.64</td>
                <td class="xsmall UpdCP xred">-1.5</td>
                <td class="q-regright UpdV">210.6</td>
                <td class="q-smright UpdW">9,011</td>
                <td class="xsmall UpdT">155</td>
                <td class="xsmall xcentre"><span class="UpdO">42.905</span>&nbsp;&nbsp;<span class="UpdH">42.93</span>&nbsp;&nbsp;<span class="UpdI">42.78</span></td>
                <td class="xsmall xcentre">44.75&nbsp;&nbsp;32.04</td>
                <td class="xsmall xcentre UpdE"></td>
                <td class="xsmall xcentre"></td>
                <td class="xsmall xcentre">realtime</td>
            </tr>
            <tr class="Upd UpdUALIOY-">
                <td class="sym">
                    <a class="qn Name" href="/Quote/Detail.aspx?symbol=ALIOY&amp;region=U">ALIOY</a> <span>- Q</span> <span class="Name">- ACTELION LTD</span>
                </td>
                <td class="bac" colspan="3">no orders</td>
                <td class="UpdL">70.66</td>
                <td class=" xgreen UpdC">+0.06</td>
                <td class="xsmall UpdCP xgreen">0.1</td>
                <td class="q-regright UpdV">123.3</td>
                <td class="q-smright UpdW">8,715</td>
                <td class="xsmall UpdT">56</td>
                <td class="xsmall xcentre"><span class="UpdO">70.538</span>&nbsp;&nbsp;<span class="UpdH">70.70</span>&nbsp;&nbsp;<span class="UpdI">70.50</span></td>
                <td class="xsmall xcentre">70.89&nbsp;&nbsp;34.83</td>
                <td class="xsmall xcentre UpdE"></td>
                <td class="xsmall xcentre"></td>
                <td class="xsmall xcentre">realtime</td>
            </tr>
        </tbody>
    </table>
</body>
</html>

这是我抓取桌子的代码:

import lxml.html
response = open('test.html')
html2 = response.read()
root = lxml.html.fromstring(html2)
for row in root.xpath('//*[@id="MainContent_Quote1_Table1_Table1"]/tbody/tr'):
    cells = row.xpath('.//td/text()')
    print cells

这是结果:

['no orders', '31.92', '-0.31', '-1.0', '851.0', '27,163', '1,461', u'xa0xa0', u'xa0xa0', u'33.74xa0xa027.09', 'realtime']
['no orders', '76.07', '-0.23', '-0.3', '336.2', '25,574', '1,785', u'xa0xa0', u'xa0xa0', u'83.00xa0xa066.28', 'realtime']
['no orders', '43.90', '-0.56', '-1.3', '316.2', '13,883', '889', u'xa0xa0', u'xa0xa0', u'44.57xa0xa043.00', 'realtime']
['no orders', '29.63', '+0.06', '0.2', '380.1', '11,263', '1,341', u'xa0xa0', u'xa0xa0', u'29.85xa0xa019.74', 'realtime']
['no orders', '35.46', '-0.23', '-0.6', '316.2', '11,213', '209', u'xa0xa0', u'xa0xa0', u'35.72xa0xa023.58', 'realtime']
['no orders', '29.15', '-0.34', '-1.2', '323.6', '9,432', '782', u'xa0xa0', u'xa0xa0', u'33.60xa0xa025.88', 'realtime']
['no orders', '42.78', '-0.64', '-1.5', '210.6', '9,011', '155', u'xa0xa0', u'xa0xa0', u'44.75xa0xa032.04', 'realtime']
['no orders', '70.66', '+0.06', '0.1', '123.3', '8,715', '56', u'xa0xa0', u'xa0xa0', u'70.89xa0xa034.83', 'realtime']

我希望它成为以下内容:

['RHHBY', 'ROCHE HLDG LTD SPONS', 'no orders', '31.92', '-0.31', '-1.0', '851.0', '27,163', '1,461', '32.03', '32.067', '31.84', '33.74', '27.09', '', '', 'realtime']
['NSRGY', 'NESTLE SA REG SHRS S', 'no orders', '76.07', '-0.23', '-0.3', '336.2', '25,574', '1,785', '75.89', '76.07', '75.66', '83.00', '66.28', '', '', 'realtime']
['NTTYY', 'NIPPON TELEGRAPH AND TELEPHONE C', 'no orders', '43.90', '-0.56', '-1.3', '316.2', '13,883', '889', '44.145', '44.15', '43.89', '44.57', '43.00', 'realtime']
...

如何在<td>的跨度和/或<a>'S?

中获得值

这张桌子可能很大。我希望它是上面的快速脚本。

我打算将此数组进行此数组,并将其写入数据库或CSV。

不知道这是否是您想要的,但是您可以在//td/spana元素中获取文本:

import lxml.html
response = open('test.html')
html2 = response.read()
root = lxml.html.fromstring(html2)
for row in root.xpath('//*[@id="MainContent_Quote1_Table1_Table1"]/tbody/tr'):
    cells=[]
    cells = row.xpath('.//td/a/text()')
    cells = cells + row.xpath('.//td/text()')
    cells = cells + row.xpath('.//td/span/text()')
    print(cells)

要消除格式,您可以使用以下内容:

print([c.replace('xa0','') for c in cells])

最新更新