我有一个表格,格式如下:
<tr class="style6"><td>SomeStuff</td></tr>
<tr><td>Some other stuff</td></tr>
<tr><td>Some other stuff</td></tr>
<tr><td>Some other stuff</td></tr>
<tr><td>Some other stuff</td></tr>
<tr><td>Some other stuff</td></tr>
<tr class="style6"><td>SomeStuff</td></tr>
<tr><td>Some other stuff</td></tr>
<tr><td>Some other stuff</td></tr>
<tr><td>Some other stuff</td></tr>
<tr><td>Some other stuff</td></tr>
<tr><td>Some other stuff</td></tr>
我想要一个行块(从style6
类开始到下一个style6
发生之前的最后一行)分成我可以迭代的组。有没有办法把它分割成块呢?我知道Xpath position
函数,但不确定它在这种情况下是否有意义。
任何想法?
一个有用的模式是计算前面的<tr class="style6"><td>SomeStuff</td></tr>
。
对于示例中的第一个组,它将是:
//tr[not(@class="style6")][count(preceding-sibling::tr[@class="style6"])=1]
第二组:
//tr[not(@class="style6")][count(preceding-sibling::tr[@class="style6"])=2]
等。
我不使用nokogiri,所以这里有一个使用Python和lxml
的例子:
>>> import lxml.html
>>> from pprint import pprint
>>> doc = lxml.html.fromstring('''<tr class="style6"><td>SomeStuff</td></tr>
... <tr><td>Some other stuff group 1</td></tr>
... <tr><td>Some other stuff group 1</td></tr>
... <tr><td>Some other stuff group 1</td></tr>
... <tr><td>Some other stuff group 1</td></tr>
... <tr><td>Some other stuff group 1</td></tr>
... <tr class="style6"><td>SomeStuff</td></tr>
... <tr><td>Some other stuff group 2</td></tr>
... <tr><td>Some other stuff group 2</td></tr>
... <tr><td>Some other stuff group 2</td></tr>
... <tr><td>Some other stuff group 2</td></tr>
... <tr><td>Some other stuff group 2</td></tr>
... <tr class="style6"><td>SomeStuff</td></tr>
... <tr><td>Some other stuff group 3</td></tr>
... <tr><td>Some other stuff group 3</td></tr>
... <tr><td>Some other stuff group 3</td></tr>
... <tr><td>Some other stuff group 3</td></tr>
... <tr><td>Some other stuff group 3</td></tr>''')
>>> pprint(list(lxml.html.tostring(row)
... for row in doc.xpath('''
... //tr[not(@class="style6")]
... [count(preceding-sibling::tr[@class="style6"])=1]''')))
[b'<tr><td>Some other stuff group 1</td></tr>n',
b'<tr><td>Some other stuff group 1</td></tr>n',
b'<tr><td>Some other stuff group 1</td></tr>n',
b'<tr><td>Some other stuff group 1</td></tr>n',
b'<tr><td>Some other stuff group 1</td></tr>n']
>>> pprint(list(lxml.html.tostring(row)
... for row in doc.xpath('''
... //tr[not(@class="style6")]
... [count(preceding-sibling::tr[@class="style6"])=2]''')))
[b'<tr><td>Some other stuff group 2</td></tr>n',
b'<tr><td>Some other stuff group 2</td></tr>n',
b'<tr><td>Some other stuff group 2</td></tr>n',
b'<tr><td>Some other stuff group 2</td></tr>n',
b'<tr><td>Some other stuff group 2</td></tr>n']
>>> pprint(list(lxml.html.tostring(row)
... for row in doc.xpath('''
... //tr[not(@class="style6")]
... [count(preceding-sibling::tr[@class="style6"])=3]''')))
[b'<tr><td>Some other stuff group 3</td></tr>n',
b'<tr><td>Some other stuff group 3</td></tr>n',
b'<tr><td>Some other stuff group 3</td></tr>n',
b'<tr><td>Some other stuff group 3</td></tr>n',
b'<tr><td>Some other stuff group 3</td></tr>']
>>>