>我正在尝试解析一个表,但我不知道如何保存其中的数据。我想将数据保存在每一行中,如下所示:
['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452]
示例表为:
html = <<EOT
<table class="open">
<tr>
<th>Table name</th>
<th>Column name 1</th>
<th>Column name 2</th>
<th>Column name 3</th>
<th>Column name 4</th>
<th>Column name 5</th>
</tr>
<tr>
<th>Raw name 1</th>
<td>2,094</td>
<td>0,017</td>
<td>0,098</td>
<td>0,113</td>
<td>0,452</td>
</tr>
.
.
.
<tr>
<th>Raw name 5</th>
<td>2,094</td>
<td>0,017</td>
<td>0,098</td>
<td>0,113</td>
<td>0,452</td>
</tr>
</table>
EOT
我的刮板代码是:
doc = Nokogiri::HTML(open(html), nil, 'UTF-8')
tables = doc.css('div.open')
@tablesArray = []
tables.each do |table|
title = table.css('tr[1] > th').text
cell_data = table.css('tr > td').text
raw_name = table.css('tr > th').text
@tablesArray << Table.new(cell_data, raw_name)
end
render template: 'scrape_krasecology'
end
end
当我尝试在 HTML 页面中显示数据时,看起来所有列名都存储在一个数组的元素中,并且所有数据都以相同的方式存储。
问题的关键在于,对多个结果调用#text
将返回每个元素#text
的串联。
让我们检查一下每个步骤的作用:
# Finds all <table>s with class open
# I'm assuming you have only one <table> so
# you don't actually have to loop through
# all tables, instead you can just operate
# on the first one. If that is not the case,
# you can use a loop the way you did
tables = doc.css('table.open')
# The text of all <th>s in <tr> one in the table
title = table.css('tr[1] > th').text
# The text of all <td>s in all <tr>s in the table
# You obviously wanted just the <td>s in one <tr>
cell_data = table.css('tr > td').text
# The text of all <th>s in all <tr>s in the table
# You obviously wanted just the <th>s in one <tr>
raw_name = table.css('tr > th').text
现在我们知道出了什么问题,这里有一个可能的解决方案:
html = <<EOT
<table class="open">
<tr>
<th>Table name</th>
<th>Column name 1</th>
<th>Column name 2</th>
<th>Column name 3</th>
<th>Column name 4</th>
<th>Column name 5</th>
</tr>
<tr>
<th>Raw name 1</th>
<td>1001</td>
<td>1002</td>
<td>1003</td>
<td>1004</td>
<td>1005</td>
</tr>
<tr>
<th>Raw name 2</th>
<td>2001</td>
<td>2002</td>
<td>2003</td>
<td>2004</td>
<td>2005</td>
</tr>
<tr>
<th>Raw name 3</th>
<td>3001</td>
<td>3002</td>
<td>3003</td>
<td>3004</td>
<td>3005</td>
</tr>
</table>
EOT
doc = Nokogiri::HTML(html, nil, 'UTF-8')
# Fetches only the first <table>. If you have
# more than one, you can loop the way you
# originally did.
table = doc.css('table.open').first
# Fetches all rows (<tr>s)
rows = table.css('tr')
# The column names are the first row (shift returns
# the first element and removes it from the array).
# On that row we get the text of each individual <th>
# This will be Table name, Column name 1, Column name 2...
column_names = rows.shift.css('th').map(&:text)
# On each of the remaining rows
text_all_rows = rows.map do |row|
# We get the name (<th>)
# On the first row this will be Raw name 1
# on the second - Raw name 2, etc.
row_name = row.css('th').text
# We get the text of each individual value (<td>)
# On the first row this will be 1001, 1002, 1003...
# on the second - 2001, 2002, 2003... etc
row_values = row.css('td').map(&:text)
# We map the name, followed by all the values
[row_name, *row_values]
end
p column_names # => ["Table name", "Column name 1", "Column name 2",
# "Column name 3", "Column name 4", "Column name 5"]
p text_all_rows # => [["Raw name 1", "1001", "1002", "1003", "1004", "1005"],
# ["Raw name 2", "2001", "2002", "2003", "2004", "2005"],
# ["Raw name 3", "3001", "3002", "3003", "3004", "3005"]]
# If you want to combine them
text_all_rows.each do |row_as_text|
p column_names.zip(row_as_text).to_h
end # =>
# {"Table name"=>"Raw name 1", "Column name 1"=>"1001", "Column name 2"=>"1002", "Column name 3"=>"1003", "Column name 4"=>"1004", "Column name 5"=>"1005"}
# {"Table name"=>"Raw name 2", "Column name 1"=>"2001", "Column name 2"=>"2002", "Column name 3"=>"2003", "Column name 4"=>"2004", "Column name 5"=>"2005"}
# {"Table name"=>"Raw name 3", "Column name 1"=>"3001", "Column name 2"=>"3002", "Column name 3"=>"3003", "Column name 4"=>"3004", "Column name 5"=>"3005"}
你想要的输出是无稽之谈:
['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452]
# ~> -:1: Invalid octal digit
# ~> ['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452]
我假设你想要引用的数字。
在剥离了阻止代码工作的内容,并将HTML减少到更易于管理的示例之后,然后运行它:
require 'nokogiri'
html = <<EOT
<table class="open">
<tr>
<th>Table name</th>
<th>Column name 1</th>
<th>Column name 2</th>
</tr>
<tr>
<th>Raw name 1</th>
<td>2,094</td>
<td>0,017</td>
</tr>
<tr>
<th>Raw name 5</th>
<td>2,094</td>
<td>0,017</td>
</tr>
</table>
EOT
doc = Nokogiri::HTML(html)
tables = doc.css('table.open')
tables_data = []
tables.each do |table|
title = table.css('tr[1] > th').text # !> assigned but unused variable - title
cell_data = table.css('tr > td').text
raw_name = table.css('tr > th').text
tables_data << [cell_data, raw_name]
end
这导致:
tables_data
# => [["2,0940,0172,0940,017",
# "Table nameColumn name 1Column name 2Raw name 1Raw name 5"]]
首先要注意的是,尽管您分配给了它,但您没有使用title
。例如,当您清理代码时,可能会发生这种情况。
css
,像search
和xpath
一样,返回一个节点集,它类似于一个节点数组。当您在 NodeSet 上使用 text
或 inner_text
时,它会返回连接成单个字符串的每个节点的文本:
获取所有包含的 Node 对象的内部文本。
这是它的行为:
require 'nokogiri'
doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')
doc.css('p').text # => "foobar"
相反,您应该循环访问找到的每个节点,并单独提取其文本。这在SO上有很多次介绍:
doc.css('p').map{ |node| node.text } # => ["foo", "bar"]
这可以简化为:
doc.css('p').map(&:text) # => ["foo", "bar"]
另请参阅"如何避免在抓取时联接节点中的所有文本"。
文档是这样说的 content
与 Node 一起使用时text
和inner_text
:
返回此节点的内容。
相反,您需要遵循单个节点的文本:
require 'nokogiri'
html = <<EOT
<table class="open">
<tr>
<th>Table name</th>
<th>Column name 1</th>
<th>Column name 2</th>
<th>Column name 3</th>
<th>Column name 4</th>
<th>Column name 5</th>
</tr>
<tr>
<th>Raw name 1</th>
<td>2,094</td>
<td>0,017</td>
<td>0,098</td>
<td>0,113</td>
<td>0,452</td>
</tr>
<tr>
<th>Raw name 5</th>
<td>2,094</td>
<td>0,017</td>
<td>0,098</td>
<td>0,113</td>
<td>0,452</td>
</tr>
</table>
EOT
tables_data = []
doc = Nokogiri::HTML(html)
doc.css('table.open').each do |table|
# find all rows in the current table, then iterate over the second all the way to the final one...
table.css('tr')[1..-1].each do |tr|
# collect the cell data and raw names from the remaining rows' cells...
raw_name = tr.at('th').text
cell_data = tr.css('td').map(&:text)
# aggregate it...
tables_data += [raw_name, cell_data]
end
end
现在的结果是:
tables_data
# => ["Raw name 1",
# ["2,094", "0,017", "0,098", "0,113", "0,452"],
# "Raw name 5",
# ["2,094", "0,017", "0,098", "0,113", "0,452"]]
你可以弄清楚如何将引用的数字强制为 Ruby 可接受的小数,或者随心所欲地操作内部数组。
我假设您从这里借用了一些代码或任何其他相关参考(或者我很抱歉添加了错误的参考) - http://quabr.com/34781600/ruby-nokogiri-parse-html-table。
但是,如果要捕获所有行,可以更改以下代码-
希望这能帮助您解决问题。
doc = Nokogiri::HTML(open(html), nil, 'UTF-8')
# We need .open tr, because we want to capture all the columns from a specific table's row
@tablesArray = doc.css('table.open tr').reduce([]) do |array, row|
# This will allow us to create result as this your illustrated one
# ie. ['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452]
array << row.css('th, td').map(&:text)
end
render template: 'scrape_krasecology'