美丽的汤网页刮刀



我正在尝试使用以下网址抓取网页https://www.bseindia.com/corporates/shpSecurities.aspx?scripcd=500209&qtrid=96.00

我想用以下 html 代码抓取一个表格。我尝试了几件事,但无法实现所需的表格以插入 csv。在这里,<"tr"> 标记没有为数据关闭,因此将数据隔离到不同的行中是一个问题。

感谢您的帮助--J

<table border='0' width='900' align='center' cellspacing='1' cellpadding='4'>
                <tr>
                    <td class='innertable_header1' rowspan='3'>Category of shareholder</td>
                    <td class='innertable_header1' rowspan='3'>Nos. of shareholders</td>
                    <td class='innertable_header1' rowspan='3'>No. of fully paid up equity shares held</td>
                    <td class='innertable_header1' rowspan='3'>No. of shares underlying Depository Receipts</td>
                    <td class='innertable_header1' rowspan='3'>Total nos. shares held</td>
                    <td class='innertable_header1' rowspan='3'>Shareholding as a % of total no. of shares (calculated as per SCRR, 1957)As a % of (A+B+C2)</td>
                    <td class='innertable_header1' rowspan='3'> Number of equity shares held in dematerialized form</td>
                </tr>
                <tr></tr>
                <tr></tr>
                <tr>
                    <td class='TTRow_left'>(A) Promoter & Promoter Group</td>
                    <td class='TTRow_right'>19</td>
                    <td class='TTRow_right'>28,17,02,889</td>
                    <td class='TTRow_right'></td>
                    <td class='TTRow_right'>28,17,02,889</td>
                    <td class='TTRow_right'>12.90</td>
                    <td class='TTRow_right'>28,17,02,889</td>
                    <tr>
                        <td class='TTRow_left'>(B) Public</td>
                        <td class='TTRow_right'>9,16,058</td>
                        <td class='TTRow_right'>1,87,81,45,362</td>
                        <td class='TTRow_right'>1,32,95,642</td>
                        <td class='TTRow_right'>1,89,14,41,004</td>
                        <td class='TTRow_right'>86.61</td>
                        <td class='TTRow_right'>1,88,74,40,959</td>
                        <tr>
                            <td class='TTRow_left'>(C1) Shares underlying DRs</td>
                            <td class='TTRow_right'></td>
                            <td class='TTRow_right'></td>
                            <td class='TTRow_right'></td>
                            <td class='TTRow_right'></td>
                            <td class='TTRow_right'>0.00</td>
                            <td class='TTRow_right'></td>
                            <tr>
                                <td class='TTRow_left'>(C2) Shares held by Employee Trust</td>
                                <td class='TTRow_right'>1</td>
                                <td class='TTRow_right'>1,08,05,896</td>
                                <td class='TTRow_right'></td>
                                <td class='TTRow_right'>1,08,05,896</td>
                                <td class='TTRow_right'>0.49</td>
                                <td class='TTRow_right'>1,08,05,896</td>
                                <tr>
                                    <td class='TTRow_left'>(C) Non Promoter-Non Public</td>
                                    <td class='TTRow_right'>1</td>
                                    <td class='TTRow_right'>1,08,05,896</td>
                                    <td class='TTRow_right'></td>
                                    <td class='TTRow_right'>1,08,05,896</td>
                                    <td class='TTRow_right'>0.49</td>
                                    <td class='TTRow_right'>1,08,05,896</td>
                                    <tr>
                                        <td class='TTRow_left'>Grand Total</td>
                                        <td class='TTRow_right'>9,16,078</td>
                                        <td class='TTRow_right'>2,17,06,54,147</td>
                                        <td class='TTRow_right'>1,32,95,642</td>
                                        <td class='TTRow_right'>2,18,39,49,789</td>
                                        <td class='TTRow_right'>100.00</td>
                                        <td class='TTRow_right'>2,17,99,49,744</td>
                                    </tr>
            </table>

你可以试试这个:

from bs4 import BeautifulSoup as soup
import urllib
import re
s = soup(str(urllib.urlopen('https://www.bseindia.com/corporates/shpSecurities.aspx?scripcd=500209&qtrid=96.00').read()), 'lxml')
results = filter(None, [re.sub('[nr]+|s{2,}', '', i.text) for i in s.find_all('td', {'class':re.compile('TTRow_right|TTRow_left')})])

输出:

[u'(A) Promoter & Promoter Group', u'19', u'28,17,02,889', u'28,17,02,889', u'12.90', u'28,17,02,889', u'(B) Public', u'9,16,058', u'1,87,81,45,362', u'1,32,95,642', u'1,89,14,41,004', u'86.61', u'1,88,74,40,959', u'(C1) Shares underlying DRs', u'0.00', u'(C2) Shares held by Employee Trust', u'1', u'1,08,05,896', u'1,08,05,896', u'0.49', u'1,08,05,896', u'(C) Non Promoter-Non Public', u'1', u'1,08,05,896', u'1,08,05,896', u'0.49', u'1,08,05,896', u'Grand Total', u'9,16,078', u'2,17,06,54,147', u'1,32,95,642', u'2,18,39,49,789', u'100.00', u'2,17,99,49,744']

最新更新