拆分CSV文件，根据内容命名，另存为HTML

单击此处查看表

我认为这是一项简单的任务，但我是一名生物学家，只知道一点点代码，经过几天的努力，我已经无计可施了。

在Mac上使用终端。我有一个CSV文件，我想按行(162行(将其拆分为单独的文件，并且我想按第一列和第二列的内容命名文件(genus_species(。然后我需要将所有162个genus_species保存为HTML文件。

我只尝试过"；"分裂"；Ruby的一部分(来自StackExchange/overflow的建议(。以下是我的一些尝试。它们是有用的论坛的弗兰肯斯坦，每次论坛结束后，我都会对为什么它不起作用发表一些评论。

示例HTML

<!DOCTYPE html>
<html><head>
<meta charset="UTF-8">
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script></head>
<body>
<h1><em><!-- Species name --></em> - <!-- Common name --></h1>
<h2>Status</h2>
<p></p>
<h2>Info</h2>
<p></p>
<h2>Time of year this bee is seen</h2>
<p></p>
<h2>Identification</h2>
<p></p>
<h3>Similar Species</h3>
<p></p>
<h2>Flowers</h2>
<p></p>
<h2>Sociality</h2>
<p></p>
<h2>Nest</h2>
<p></p>
<div id="refs" class="references">
--<br>More information:<br> <!-- <a href="https://bugguide.net/node/view/70932">Bug Guide</a> --></div>
</body></html>

基于评论的更多信息

以下是从文本文件中复制的一些行：

Genus,species,Common name,Status,Info,Time of year this bee is seen,Identification,Similar Species,Flowers,Sociality,Nest,Bug Guide,Discover Life,Other,
Agapostemon,melliventris,Honey-tailed Striped-Sweat bee,Secure G5,Excavates into deep burrows in ground nests,March-December,Agapostemon males have black and yellow stripes on the abdomen. Females have a yellow band on the lower margin of the clypeus.,All other Agapostemon species,Wide variety of plants,Solitary,"Deep, underground excavation",https://bugguide.net/node/view/70932,https://www.discoverlife.org/20/q?search=Agapostemon+melliventris,https://explorer.natureserve.org/Taxon/ELEMENT_GLOBAL.2.928401/Agapostemon_melliventris,
Agapostemon,sericeus,Silky Striped Sweat Bee,Secure G5,"Not choosy about lawn, as long as flowers are present",April-October,Agapostemon males have black and yellow stripes on the abdomen. A. sericeus males have a tooth on its hind femur. Female has metallic green abdomen.,All other Agapostemon species,Wide variety of plants,Solitary,Ground-nester in loamy soils,https://bugguide.net/node/view/83023,https://www.discoverlife.org/mp/20q?search=Agapostemon+sericeus,https://www.sharpeatmanguides.com/sweat-bees,
Agapostemon,splendens,Brown-winged Striped-Sweat Bee,Secure G5,This is the most common Agapostemon found in the southeast region,April-October,Agapostemon males have black and yellow stripes on the abdomen. A. splendens have brown wings. The female abdomen is often somewhat bluish.,All other Agapostemon species,"Jacquemontia reclinata, wide variety of plants",Solitary,Ground-nester in sandy soils,https://bugguide.net/node/view/74478,https://www.discoverlife.org/mp/20q?search=Agapostemon+splendens,,

根据评论更新了我尝试过的代码这很有效，我认为它正朝着我想要的方向前进，但在终端窗口中很难判断：

f = File.new("bee_key_fact_sheet .csv")
f.each_line { |line| puts line }
Currently playing with some kind of File.write line to add here and then close?

尝试#1

file = File.open("bee_key_fact_sheet.csv")
awk   
'(NR==1){header=$0;next}
(NR%l==2) {
close(file); 
file=sprintf("%s.%0.5d.csv",FILENAME,++c)
sub(/csv[.]/,"",file)
print header > file
}
{f.write}' 
File.close

#AWK未被识别；显示所有可能性(y/n("；我试着返回"；y"；以及"；是"；两次都说我的答案不被识别

尝试#2

file_data = File.read("bee_key_fact_sheet.csv").split

#这是有效的，但按每个逗号拆分

尝试#3

file_data = File.foreach("bee_key_fact_sheet.csv") { |line| puts line}.split

#这返回了一些比按每个逗号拆分稍微不那么混乱的东西，但得到了这个错误消息"；nil:NilClass的未定义方法"split"；

尝试#4

bee_key_fact_sheet.csv.foreach('so1.csv', :headers => true, :col_sep => ",", :skip_blanks => true) do |row|
id, name = row[0], row[1]
unless (id =~ /#/)
names = name.split
end

#这没有返回

CSV输入示例(bee_key_fact_sheet.CSV(：

Genus,species,Common name,Status,Info,Time of year this bee is seen,Identification,Similar Species,Flowers,Sociality,Nest,Bug Guide,Discover Life,Other,
Agapostemon,melliventris,Honey-tailed Striped-Sweat bee,Secure G5,Excavates into deep burrows in ground nests,March-December,Agapostemon males have black and yellow stripes on the abdomen. Females have a yellow band on the lower margin of the clypeus.,All other Agapostemon species,Wide variety of plants,Solitary,"Deep, underground excavation",https://bugguide.net/node/view/70932,https://www.discoverlife.org/20/q?search=Agapostemon+melliventris,https://explorer.natureserve.org/Taxon/ELEMENT_GLOBAL.2.928401/Agapostemon_melliventris,
Agapostemon,sericeus,Silky Striped Sweat Bee,Secure G5,"Not choosy about lawn, as long as flowers are present",April-October,Agapostemon males have black and yellow stripes on the abdomen. A. sericeus males have a tooth on its hind femur. Female has metallic green abdomen.,All other Agapostemon species,Wide variety of plants,Solitary,Ground-nester in loamy soils,https://bugguide.net/node/view/83023,https://www.discoverlife.org/mp/20q?search=Agapostemon+sericeus,https://www.sharpeatmanguides.com/sweat-bees,
Agapostemon,splendens,Brown-winged Striped-Sweat Bee,Secure G5,This is the most common Agapostemon found in the southeast region,April-October,Agapostemon males have black and yellow stripes on the abdomen. A. splendens have brown wings. The female abdomen is often somewhat bluish.,All other Agapostemon species,"Jacquemontia reclinata, wide variety of plants",Solitary,Ground-nester in sandy soils,https://bugguide.net/node/view/74478,https://www.discoverlife.org/mp/20q?search=Agapostemon+splendens,,

在这个CSV中，所有的行(包括标题(都以逗号结尾，所以最后一列可能没有任何意义，将被丢弃
此外，数据中有逗号(带双引号的字段(，因此需要realCSV解析器来读取文件的内容BTW，您选择Ruby执行此任务是正确的，因为它的核心库中包含一个CSV解析器

以下是读取CSV的一种方法(编辑：修复旧Rubys的CSV#Row转换(：

require 'csv'

filepath = 'bee_key_fact_sheet.csv'

CSV.foreach(filepath, headers: true) do |row|
genus, species = row[0], row[1]
#data = row[0...-1] # NOTE: not sure about the Ruby version compatibility
data = row.to_hash.values[0...-1]

filename = "#{genus}_#{species}.txt".tr("/",'')
filecontent = "  * #{data.join("n  * ")}"

puts "n#{filename}:n#{filecontent}"
end

关于tr("/",'')：文件名中允许的字符取决于文件系统。所有文件系统(据我所知(至少禁止NULL字节和斜杠字符，所以我去掉了它们(但您可能还想去掉一些(

~~问题：期望的HTML输出究竟是什么？一排桌子~~

更新：HTML生成

以编程方式生成内容时，将数据转义为正确的格式/语言/上下文是非常重要的。在Ruby中，您可以使用CGI.escapeHTML来转义HTML

HTML输出示例：

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
</head>
<body>
<h1><em><!-- Species name --></em> - <!-- Common name --></h1>
<h2>Status</h2>
<p></p>
<h2>Info</h2>
<p></p>
<h2>Time of year this bee is seen</h2>
<p></p>
<h2>Identification</h2>
<p></p>
<h3>Similar Species</h3>
<p></p>
<h2>Flowers</h2>
<p></p>
<h2>Sociality</h2>
<p></p>
<h2>Nest</h2>
<p></p>
<div id="refs" class="references">
--
<br>More information:
<br> <!-- <a href="https://bugguide.net/node/view/70932">Bug Guide</a> -->
</div>
</body>
</html>

我将对HTML:进行一些更改

为页面添加标题
删除不需要缝合的MathJax
将<h3>标记转换为<h2>，因为您仅将其用于"类似物种">。更改它还允许在生成HTML时使用循环
您在CSV中有两个链接，但在HTML中没有使用："探索生活和"；其他">，你不想展示它们吗？我添加了代码；-(

好的，首先，创建一个函数，在给定CSV行的情况下，生成相应的HTML。这里我使用ERB模板，但您可以直接使用字符串文字(编辑：修复Ruby的ERB#result参数<2.4.0(：

require 'cgi'
require 'erb'

def renderHTML row
htmlsafe = row.each_with_object({}) { |(k,v),h| h[k] = CGI.escapeHTML v if v }
template = <<-'EOF'
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title><%= "#{htmlsafe['Genus']} #{htmlsafe['species']}" %></title>
</head>
<body>
<h1><em><%= "#{htmlsafe['Genus']} #{htmlsafe['species']}" %></em> - <%= htmlsafe['Common name'] %></h1>
<% for key in ['Status','Info','Time of year this bee is seen','Identification','Similar Species','Flowers','Sociality','Nest'] %>
<h2><%= key %></h2>
<p><%= htmlsafe[key] %></p>
<% end %>
<div id="refs" class="references">
--
<br>More information:
<% for key in ['Bug Guide', 'Discover Life', 'Other'].select{ |k| htmlsafe[k] } %>
<br><a href="<%= htmlsafe[key] %>"><%= key %></a>
<% end %>
</div>
</body>
</html>
EOF
#ERB.new(template, trim_mode: "<>").result(binding) # NOTE: only for Ruby >= 2.4.0
ERB.new(template, nil, "<>").result(binding)
end

然后，您可以在读取CSV文件的每一行时调用上一个函数：

require 'csv'

filepath = 'bee_key_fact_sheet.csv'

CSV.foreach(filepath, headers: true) do |row|
filename = "#{row['Genus']}_#{row['species']}.html".tr("/",'')
html = renderHTML row
puts "n# #{filename}n#{html}"
#File.write(filename, html)
end

注意：我注释掉了将创建HTML文件的File.write行。

你能试试这个吗？它应该是读取文件的行

f = File.new("name_of_file")
f.each_line { |line| puts line }

您可以稍后将它们另存为新文件，详细信息请点击此处：如何在Ruby 中创建文件

更新：HTML生成

相关内容

最新更新

热门标签：