尝试使用Ruby脚本找到更好的方法来通过多个目录操作大型、多个txt文件

我正在制造环境中收集产品的测试测量数据。被测单元的测试测量结果由测试系统生成。它在一个2Mb的txt文件中，保存在由产品分隔的共享文件夹中。

文件夹结构看起来像。。。

LOGS
|-Product1
|  |-log_p1_1.txt
|  |-log_p1_2.txt
|  |..
|-Product2
|  |-log_p2_1.txt
|  |-log_p2_2.txt
|  |..
|-...

我的ruby脚本可以遍历LOGS下的每个Product目录，然后读取每个log_px_n.txt文件，解析文件中需要的数据，并将其更新到数据库中。

问题是的所有log_px_n.txt文件都必须保存在其当前目录中，包括旧文件和新文件，而我需要在生成新的log_px_n.tx文件后立即更新数据库。

我今天所做的是尝试遍历每个Product目录，然后读取每个单独的.txt文件，然后在更新文件后将其放入数据库（如果它不存在的话）。

我的剧本看起来像。。

Dir['*'].each do |product|
  product_dir = File.join(BASE_DIR, product)
  Dir.chdir(product_dir)
     Dir['*.txt'].each do |log|
       if (Time.now - File.mtime(log) < SIX_HOURS_AGO)   # take only new files in last six hours
       # Here we do..
       # - read each 2Mb .txt file
       # - extract infomation from txt file
       # - update into database
     end
   end
end

有多达30个不同的产品目录，每个产品包含大约1000个.txt文件（每个2Mb），而且它们还在增长！

我对存储这样的.txt文件的磁盘空间没有问题，但对完成此操作所需的时间没有问题。当在代码块上运行时，每次完成任务需要>45分钟。

有没有更好的方法来处理这种情况？

更新：我按照Iced的建议尝试使用探查器，所以我运行了下面的代码，得到了以下结果。。。

require 'profiler'
class MyCollector
def initialize(dir, period, *filetypes)
    @dir = dir
    @filetypes = filetypes.join(',')
    @period = period
end
def collect
    Dir.chdir(@dir)
    Dir.glob('*').each do |product|
        products_dir = File.join(@dir, product)
        Dir.chdir(products_dir)
        puts "at product #{product}"
        Dir.glob("**/*.{#{@filetypes}}").each do |log|
            if Time.now - File.mtime(log) < @period
                puts Time.new
            end
        end
    end
end
path = '//10.1.2.54/Shares/Talend/PRODFILES/LOGS'
SIX_HOURS_AGO = 21600
Profiler__::start_profile
collector = MyCollector.new(path, SIX_HOURS_AGO, "LOG")
collector.collect
Profiler__::stop_profile
Profiler__::print_profile(STDOUT)

结果显示。。。

at product ABU43E .. .. .. at product AXF40J at product ACZ16C 2014-04-21 17:32:07 +0700 at product ABZ14C at product AXF90E at product ABZ14B at product ABK43E at product ABK01A 2014-04-21 17:32:24 +0700 2014-04-21 17:32:24 +0700 at product ABU05G at product ABZABF 2014-04-21 17:32:28 +0700 2014-04-21 17:32:28 +0700 2014-04-21 17:32:28 +0700 2014-04-21 17:32:28 +0700 2014-04-21 17:32:28 +0700 2014-04-21 17:32:28 +0700 % cumulative self self total time seconds seconds calls ms/call ms/call name 32.54 1.99 1.99 43 46.40 265.60 Array#each 24.17 3.48 1.48 41075 0.04 0.04 File#mtime 13.72 4.32 0.84 43 19.AX 19.AX Dir#glob 9.13 4.88 0.AX 41075 0.01 0.03 Time#- 8.14 5.38 0.50 41075 0.01 0.01 Float#quo 6.65 5.79 0.41 41075 0.01 0.01 Time#now 2.06 5.91 0.13 41084 0.00 0.00 Time#initialize 1.79 6.02 0.11 41075 0.00 0.00 Float#< 1.79 6.13 0.11 41075 0.00 0.00 Float#/ 0.00 6.13 0.00 1 0.00 0.00 Array#join 0.00 6.13 0.00 51 0.00 0.00 Kernel.puts 0.00 6.13 0.00 51 0.00 0.00 IO#puts 0.00 6.13 0.00 102 0.00 0.00 IO#write 0.00 6.13 0.00 42 0.00 0.00 File#join 0.00 6.13 0.00 43 0.00 0.00 Dir#chdir 0.00 6.13 0.00 10 0.00 0.00 Class#new 0.00 6.13 0.00 1 0.00 0.00 MyCollector#initialize 0.00 6.13 0.00 9 0.00 0.00 Integer#round 0.00 6.13 0.00 9 0.00 0.00 Time#to_s 0.00 6.13 0.00 1 0.00 6131.00 MyCollector#collect 0.00 6.13 0.00 1 0.00 6131.00 #toplevel [Finished in 477.5s]

事实证明，浏览每个目录中的每个文件最多需要7分钟。然后呼叫mtime。虽然我的.txt文件是2Mb，但它应该不会花那么长时间，不是吗？

有什么建议吗？

依赖mtime是不健壮的。事实上，Rails在命名资产文件的版本时从使用mtime转换为hash。

您应该保留一个文件哈希对的列表。可以这样获得：

require "digest"
file_hash_pair =
Dir.glob("LOGS/**/*")
.select{|f| File.file?(f)}
.map{|f| [f, Digest::SHA1.hexdigest(File.read(f))]}

也许您可以将其内容作为YAML保存在一个文件中。您可以每次运行上面的代码，只要file_hash_pair与以前的值不同，就可以判断出发生了更改。如果file_hash_pair.transpose[0]发生了更改，则可以判断出存在文件操作。如果对于特定的[file, hash]对，hash发生了更改，则可以判断文件file发生了更改。

相关内容

最新更新

热门标签：