<TD> 使用 Python 从元素中获取数据

我正在为 plex 编写一个代理，我正在抓取以下 html 表我对python和网络抓取一般都很陌生

我正在尝试获取数据 XXXXXXXXXX

数据

<table class="d">
<tbody>
<tr>
<th class="ch">title</th>
<th class="ch">released</th>
<th class="ch">company</th>
<th class="ch">type</th>
<th class="ch">rating</th>
<th class="ch">category</th>
</tr>
<tr>
<td class="cd" valign="top">
<a href="/V/6/58996.html">XXXXXXXXXX</a>
</td>
<td class="cd">2015</td>
<td class="cd">My Films</td>
<td class="cd">&nbsp;</td>
<td class="cd">&nbsp;</td>
<td class="cd">General Hardcore</td>
</tr>
</tbody>
</table>

代码

这是我正在使用的代码片段：

myTable = HTML.ElementFromURL(searchQuery, sleep=REQUEST_DELAY).xpath('//table[contains(@class,"d")]/tr')
self.log('SEARCH:: My Table: %s', myTable)
# This logs the following
#2019-12-26 00:26:49,329 (17a4) :  INFO (logkit:16) - GEVI - SEARCH:: My Table: [<Element tr at 0x5225c30>, <Element tr at 0x5225c00>]

for myRow in myTable:
siteTitle = title[0]
self.log('SEARCH:: Site Title: %s', siteTitle)
siteTitle = title[0].text_content().strip()
self.log('SEARCH:: Site Title: %s', siteTitle)
# This logs the following for <tr>/<th> - ROW 1
# 2019-12-26 00:26:49,335 (17a4) :  INFO (logkit:16) - GEVI - SEARCH:: Site Title: <Element th at 0x5225180>
# 2019-12-26 00:26:49,342 (17a4) :  INFO (logkit:16) - GEVI - SEARCH:: Site Title: title
# This logs the following for <tr>/<th> - ROW 2
# 2019-12-26 00:26:49,362 (17a4) :  INFO (logkit:16) - GEVI - SEARCH:: Site Title: <Element td at 0x52256f0>
# 2019-12-26 00:26:49,369 (17a4) :  INFO (logkit:16) - GEVI - SEARCH:: Site Title:                              #### this is my issue... should be XXXXXXXXXX

# I can get the href using the following code
siteURL = myRow.xpath('.//td/a')[0].get('href')

问题

一个。如何获取值"XXXXXXXXXX"，我尝试使用 xPath，但它从同一页面上的另一个表中获取数据 B.有没有更好的方法来获取 href 属性？

其他

我正在使用的 python 库是导入日期时间，行缓存，平台，操作系统， re，字符串，系统， urllib

我不能使用美丽汤，因为这是 plex 的代理，因此我假设任何想要使用此代理的人都必须安装 beautifulsoup。所以这是不行的

这是怎么回事？

from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = '''<table class="d">
<tbody>
<tr>
<th class="ch">title</th>
<th class="ch">released</th>
<th class="ch">company</th>
<th class="ch">type</th>
<th class="ch">rating</th>
<th class="ch">category</th>
</tr>
<tr>
<td class="cd" valign="top">
<a href="/V/6/58996.html">XXXXXXXXXX</a>
</td>
<td class="cd">2015</td>
<td class="cd">My Films</td>
<td class="cd">&nbsp;</td>
<td class="cd">&nbsp;</td>
<td class="cd">General Hardcore</td>
</tr>
</tbody>
</table>'''
doc = SimplifiedDoc(html)
table = doc.getElement('table','d') # doc.getElement(tag='table',attr='class',value='d')
trs = table.trs.contains('<a ') # table.getElementsByTag('tr').contains('<a ')
for tr in trs:
a = tr.a
print (a) 
print (a.text) # XXXXXXXXXX

相关内容

最新更新

热门标签：