拆分CSV文件,根据内容命名,另存为HTML



单击此处查看表

我认为这是一项简单的任务,但我是一名生物学家,只知道一点点代码,经过几天的努力,我已经无计可施了。

在Mac上使用终端。我有一个CSV文件,我想按行(162行(将其拆分为单独的文件,并且我想按第一列和第二列的内容命名文件(genus_species(。然后我需要将所有162个genus_species保存为HTML文件。

我只尝试过";"分裂";Ruby的一部分(来自StackExchange/overflow的建议(。以下是我的一些尝试。它们是有用的论坛的弗兰肯斯坦,每次论坛结束后,我都会对为什么它不起作用发表一些评论。

示例HTML

<!DOCTYPE html>
<html><head>
<meta charset="UTF-8">
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script></head>
<body>
<h1><em><!-- Species name --></em> - <!-- Common name --></h1>
<h2>Status</h2>
<p></p>
<h2>Info</h2>
<p></p>
<h2>Time of year this bee is seen</h2>
<p></p>
<h2>Identification</h2>
<p></p>
<h3>Similar Species</h3>
<p></p>
<h2>Flowers</h2>
<p></p>
<h2>Sociality</h2>
<p></p>
<h2>Nest</h2>
<p></p>
<div id="refs" class="references">
--<br>More information:<br> <!-- <a href="https://bugguide.net/node/view/70932">Bug Guide</a> --></div>
</body></html>

基于评论的更多信息

以下是从文本文件中复制的一些行:

Genus,species,Common name,Status,Info,Time of year this bee is seen,Identification,Similar Species,Flowers,Sociality,Nest,Bug Guide,Discover Life,Other,
Agapostemon,melliventris,Honey-tailed Striped-Sweat bee,Secure G5,Excavates into deep burrows in ground nests,March-December,Agapostemon males have black and yellow stripes on the abdomen. Females have a yellow band on the lower margin of the clypeus.,All other Agapostemon species,Wide variety of plants,Solitary,"Deep, underground excavation",https://bugguide.net/node/view/70932,https://www.discoverlife.org/20/q?search=Agapostemon+melliventris,https://explorer.natureserve.org/Taxon/ELEMENT_GLOBAL.2.928401/Agapostemon_melliventris,
Agapostemon,sericeus,Silky Striped Sweat Bee,Secure G5,"Not choosy about lawn, as long as flowers are present",April-October,Agapostemon males have black and yellow stripes on the abdomen. A. sericeus males have a tooth on its hind femur. Female has metallic green abdomen.,All other Agapostemon species,Wide variety of plants,Solitary,Ground-nester in loamy soils,https://bugguide.net/node/view/83023,https://www.discoverlife.org/mp/20q?search=Agapostemon+sericeus,https://www.sharpeatmanguides.com/sweat-bees,
Agapostemon,splendens,Brown-winged Striped-Sweat Bee,Secure G5,This is the most common Agapostemon found in the southeast region,April-October,Agapostemon males have black and yellow stripes on the abdomen. A. splendens have brown wings. The female abdomen is often somewhat bluish.,All other Agapostemon species,"Jacquemontia reclinata, wide variety of plants",Solitary,Ground-nester in sandy soils,https://bugguide.net/node/view/74478,https://www.discoverlife.org/mp/20q?search=Agapostemon+splendens,,

根据评论更新了我尝试过的代码这很有效,我认为它正朝着我想要的方向前进,但在终端窗口中很难判断:

f = File.new("bee_key_fact_sheet .csv")
f.each_line { |line| puts line }
Currently playing with some kind of File.write line to add here and then close? 

尝试#1

file = File.open("bee_key_fact_sheet.csv")
awk   
'(NR==1){header=$0;next}
(NR%l==2) {
close(file); 
file=sprintf("%s.%0.5d.csv",FILENAME,++c)
sub(/csv[.]/,"",file)
print header > file
}
{f.write}' 
File.close

#AWK未被识别;显示所有可能性(y/n(";我试着返回";y";以及";是";两次都说我的答案不被识别

尝试#2

file_data = File.read("bee_key_fact_sheet.csv").split 

#这是有效的,但按每个逗号拆分

尝试#3

file_data = File.foreach("bee_key_fact_sheet.csv") { |line| puts line}.split  

#这返回了一些比按每个逗号拆分稍微不那么混乱的东西,但得到了这个错误消息";nil:NilClass的未定义方法"split";

尝试#4

bee_key_fact_sheet.csv.foreach('so1.csv', :headers => true, :col_sep => ",", :skip_blanks => true) do |row|
id, name = row[0], row[1]
unless (id =~ /#/)
names = name.split
end

#这没有返回

CSV输入示例(bee_key_fact_sheet.CSV(:

Genus,species,Common name,Status,Info,Time of year this bee is seen,Identification,Similar Species,Flowers,Sociality,Nest,Bug Guide,Discover Life,Other,
Agapostemon,melliventris,Honey-tailed Striped-Sweat bee,Secure G5,Excavates into deep burrows in ground nests,March-December,Agapostemon males have black and yellow stripes on the abdomen. Females have a yellow band on the lower margin of the clypeus.,All other Agapostemon species,Wide variety of plants,Solitary,"Deep, underground excavation",https://bugguide.net/node/view/70932,https://www.discoverlife.org/20/q?search=Agapostemon+melliventris,https://explorer.natureserve.org/Taxon/ELEMENT_GLOBAL.2.928401/Agapostemon_melliventris,
Agapostemon,sericeus,Silky Striped Sweat Bee,Secure G5,"Not choosy about lawn, as long as flowers are present",April-October,Agapostemon males have black and yellow stripes on the abdomen. A. sericeus males have a tooth on its hind femur. Female has metallic green abdomen.,All other Agapostemon species,Wide variety of plants,Solitary,Ground-nester in loamy soils,https://bugguide.net/node/view/83023,https://www.discoverlife.org/mp/20q?search=Agapostemon+sericeus,https://www.sharpeatmanguides.com/sweat-bees,
Agapostemon,splendens,Brown-winged Striped-Sweat Bee,Secure G5,This is the most common Agapostemon found in the southeast region,April-October,Agapostemon males have black and yellow stripes on the abdomen. A. splendens have brown wings. The female abdomen is often somewhat bluish.,All other Agapostemon species,"Jacquemontia reclinata, wide variety of plants",Solitary,Ground-nester in sandy soils,https://bugguide.net/node/view/74478,https://www.discoverlife.org/mp/20q?search=Agapostemon+splendens,,

在这个CSV中,所有的行(包括标题(都以逗号结尾,所以最后一列可能没有任何意义,将被丢弃
此外,数据中有逗号(带双引号的字段(,因此需要realCSV解析器来读取文件的内容BTW,您选择Ruby执行此任务是正确的,因为它的核心库中包含一个CSV解析器

以下是读取CSV的一种方法(编辑:修复旧Rubys的CSV#Row转换(:

require 'csv'

filepath = 'bee_key_fact_sheet.csv'

CSV.foreach(filepath, headers: true) do |row|
genus, species = row[0], row[1]
#data = row[0...-1] # NOTE: not sure about the Ruby version compatibility
data = row.to_hash.values[0...-1]

filename = "#{genus}_#{species}.txt".tr("/",'')
filecontent = "  * #{data.join("n  * ")}"

puts "n#{filename}:n#{filecontent}"
end

关于tr("/",''):文件名中允许的字符取决于文件系统。所有文件系统(据我所知(至少禁止NULL字节斜杠字符,所以我去掉了它们(但您可能还想去掉一些(

问题:期望的HTML输出究竟是什么?一排桌子


更新:HTML生成

以编程方式生成内容时,数据转义为正确的格式/语言/上下文是非常重要的。在Ruby中,您可以使用CGI.escapeHTML来转义HTML

HTML输出示例:

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
</head>
<body>
<h1><em><!-- Species name --></em> - <!-- Common name --></h1>
<h2>Status</h2>
<p></p>
<h2>Info</h2>
<p></p>
<h2>Time of year this bee is seen</h2>
<p></p>
<h2>Identification</h2>
<p></p>
<h3>Similar Species</h3>
<p></p>
<h2>Flowers</h2>
<p></p>
<h2>Sociality</h2>
<p></p>
<h2>Nest</h2>
<p></p>
<div id="refs" class="references">
--
<br>More information:
<br> <!-- <a href="https://bugguide.net/node/view/70932">Bug Guide</a> -->
</div>
</body>
</html>

我将对HTML:进行一些更改

  • 为页面添加标题
  • 删除不需要缝合的MathJax
  • <h3>标记转换为<h2>,因为您仅将其用于"类似物种">。更改它还允许在生成HTML时使用循环
  • 您在CSV中有两个链接,但在HTML中没有使用:"探索生活";其他">,你不想展示它们吗?我添加了代码;-(

好的,首先,创建一个函数,在给定CSV行的情况下,生成相应的HTML。这里我使用ERB模板,但您可以直接使用字符串文字(编辑:修复Ruby的ERB#result参数<2.4.0(:

require 'cgi'
require 'erb'

def renderHTML row
htmlsafe = row.each_with_object({}) { |(k,v),h| h[k] = CGI.escapeHTML v if v }
template = <<-'EOF'
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title><%= "#{htmlsafe['Genus']} #{htmlsafe['species']}" %></title>
</head>
<body>
<h1><em><%= "#{htmlsafe['Genus']} #{htmlsafe['species']}" %></em> - <%= htmlsafe['Common name'] %></h1>
<% for key in ['Status','Info','Time of year this bee is seen','Identification','Similar Species','Flowers','Sociality','Nest'] %>
<h2><%= key %></h2>
<p><%= htmlsafe[key] %></p>
<% end %>
<div id="refs" class="references">
--
<br>More information:
<% for key in ['Bug Guide', 'Discover Life', 'Other'].select{ |k| htmlsafe[k] } %>
<br><a href="<%= htmlsafe[key] %>"><%= key %></a>
<% end %>
</div>
</body>
</html>
EOF
#ERB.new(template, trim_mode: "<>").result(binding) # NOTE: only for Ruby >= 2.4.0
ERB.new(template, nil, "<>").result(binding)
end

然后,您可以在读取CSV文件的每一行时调用上一个函数:

require 'csv'

filepath = 'bee_key_fact_sheet.csv'

CSV.foreach(filepath, headers: true) do |row|
filename = "#{row['Genus']}_#{row['species']}.html".tr("/",'')
html = renderHTML row
puts "n# #{filename}n#{html}"
#File.write(filename, html)
end

注意:我注释掉了将创建HTML文件的File.write行。

你能试试这个吗?它应该是读取文件的行

f = File.new("name_of_file")
f.each_line { |line| puts line }

您可以稍后将它们另存为新文件,详细信息请点击此处:如何在Ruby 中创建文件

最新更新