在文本文件中转置矩阵的有效方法是什么



我有一个文本文件,其中包含一个二维矩阵。它看起来如下。

01 02 03 04 05
06 07 08 09 10
11 12 13 14 15
16 17 18 19 20

正如您所看到的,每一行由一个新行分隔,每一列由一个空格分隔。我需要用一种有效的方法转置这个矩阵。

01 06 11 16
02 07 12 17
03 08 04 05
04 09 14 19
05 10 15 20

实际上,矩阵是10000乘14000。单个元件是双/浮动的。如果不是不可能的话,尝试将这个文件/矩阵全部转换到内存中也是非常昂贵的。

有人知道一个utilneneneba API可以做这样的事情或一种有效的方法吗?

我尝试过的:我天真的方法是为(转置矩阵的)每列创建一个临时文件。因此,对于10000行,我将有10000个临时文件。当我读取每一行时,我标记每个值,并将该值附加到相应的文件中。因此,对于上面的例子,我将得到如下的内容。

file-0: 01 06 11 16
file-1: 02 07 12 17
file-3: 03 08 13 18
file-4: 04 09 14 19
file-5: 05 10 15 20

然后,我将每个文件读回,并将它们附加到一个文件中。我想知道是否有更聪明的方法,因为我知道文件i/o操作将是一个痛点。

具有最低内存消耗和极低性能的解决方案:

import org.apache.commons.io.FileUtils;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
public class MatrixTransposer {
  private static final String TMP_DIR = System.getProperty("java.io.tmpdir") + "/";
  private static final String EXTENSION = ".matrix.tmp.result";
  private final String original;
  private final String dst;
  public MatrixTransposer(String original, String dst) {
    this.original = original;
    this.dst = dst;
  }
  public void transpose() throws IOException {
    deleteTempFiles();
    int max = 0;
    FileReader fileReader = null;
    BufferedReader reader = null;
    try {
      fileReader = new FileReader(original);
      reader = new BufferedReader(fileReader);
      String row;
      while((row = reader.readLine()) != null) {
        max = appendRow(max, row, 0);
      }
    } finally {
      if (null != reader) reader.close();
      if (null != fileReader) fileReader.close();
    }

    mergeResultingRows(max);
  }
  private void deleteTempFiles() {
    for (String tmp : new File(TMP_DIR).list()) {
      if (tmp.endsWith(EXTENSION)) {
        FileUtils.deleteQuietly(new File(TMP_DIR + "/" + tmp));
      }
    }
  }
  private void mergeResultingRows(int max) throws IOException {
    FileUtils.deleteQuietly(new File(dst));
    FileWriter writer = null;
    BufferedWriter out = null;
    try {
      writer = new FileWriter(new File(dst), true);
      out = new BufferedWriter(writer);
      for (int i = 0; i <= max; i++) {
        out.write(FileUtils.readFileToString(new File(TMP_DIR + i + EXTENSION)) + "rn");
      }
    } finally {
      if (null != out) out.close();
      if (null != writer) writer.close();
    }
  }
  private int appendRow(int max, String row, int i) throws IOException {
    for (String element : row.split(" ")) {
      FileWriter writer = null;
      BufferedWriter out = null;
      try {
        writer = new FileWriter(TMP_DIR + i + EXTENSION, true);
        out = new BufferedWriter(writer);
        out.write(columnPrefix(i) + element);
      } finally {
        if (null != out) out.close();
        if (null != writer) writer.close();
      }
      max = Math.max(i++, max);
    }
    return max;
  }
  private String columnPrefix(int i) {
    return (0 == i ? "" : " ");
  }
  public static void main(String[] args) throws IOException {
    new MatrixTransposer("c:/temp/mt/original.txt", "c:/temp/mt/transposed.txt").transpose();
  }
}

总大小为1.12GB(如果是双倍大小),如果是浮点大小,则为其一半。这对于今天的机器来说足够小,你可以在内存中完成。不过,您可能想在适当的位置进行换位,这是一项相当重要的任务。维基百科的文章提供了更多的链接。

我建议您在不消耗太多内存的情况下评估可以读取的列数。然后,通过多次逐块读取源文件(涉及列数)来编写最终文件。假设您有10000列。首先读取集合中源文件的第0到250列,然后写入最终文件。然后对第250列到第500列再次执行此操作,依此类推

public class TransposeMatrixUtils {
    private static final Logger logger = LoggerFactory.getLogger(TransposeMatrixUtils.class);
    // Max number of bytes of the src file involved in each chunk
    public static int MAX_BYTES_PER_CHUNK = 1024 * 50_000;// 50 MB
    public static File transposeMatrix(File srcFile, String separator) throws IOException {
        File output = File.createTempFile("output", ".txt");
        transposeMatrix(srcFile, output, separator);
        return output;
    }
    public static void transposeMatrix(File srcFile, File destFile, String separator) throws IOException {
        long bytesPerColumn = assessBytesPerColumn(srcFile, separator);// rough assessment of bytes par column
        int nbColsPerChunk = (int) (MAX_BYTES_PER_CHUNK / bytesPerColumn);// number of columns per chunk according to the limit of bytes to be used per chunk
        if (nbColsPerChunk == 0) nbColsPerChunk = 1;// in case a single column has more bytes than the limit ...
        logger.debug("file length : {} bytes. max bytes per chunk : {}. nb columns per chunk : {}.", srcFile.length(), MAX_BYTES_PER_CHUNK, nbColsPerChunk);
        try (FileWriter fw = new FileWriter(destFile); BufferedWriter bw = new BufferedWriter(fw)) {
            boolean remainingColumns = true;
            int offset = 0;
            while (remainingColumns) {
                remainingColumns = writeColumnsInRows(srcFile, bw, separator, offset, nbColsPerChunk);
                offset += nbColsPerChunk;
            }
        }
    }
    private static boolean writeColumnsInRows(File srcFile, BufferedWriter bw, String separator, int offset, int nbColumns) throws IOException {
        List<String>[] newRows;
        boolean remainingColumns = true;
        try (FileReader fr = new FileReader(srcFile); BufferedReader br = new BufferedReader(fr)) {
            String[] split0 = br.readLine().split(separator);
            if (split0.length <= offset + nbColumns) remainingColumns = false;
            int lastColumnIndex = Math.min(split0.length, offset + nbColumns);
            logger.debug("chunk for column {} to {} among {}", offset, lastColumnIndex, split0.length);
            newRows = new List[lastColumnIndex - offset];
            for (int i = 0; i < newRows.length; i++) {
                newRows[i] = new ArrayList<>();
                newRows[i].add(split0[i + offset]);
            }
            String line;
            while ((line = br.readLine()) != null) {
                String[] split = line.split(separator);
                for (int i = 0; i < newRows.length; i++) {
                    newRows[i].add(split[i + offset]);
                }
            }
        }
        for (int i = 0; i < newRows.length; i++) {
            bw.write(newRows[i].get(0));
            for (int j = 1; j < newRows[i].size(); j++) {
                bw.write(separator);
                bw.write(newRows[i].get(j));
            }
            bw.newLine();
        }
        return remainingColumns;
    }
    private static long assessBytesPerColumn(File file, String separator) throws IOException {
        try (FileReader fr = new FileReader(file); BufferedReader br = new BufferedReader(fr)) {
            int nbColumns = br.readLine().split(separator).length;
            return file.length() / nbColumns;
        }
    }
}

它应该比创建大量会生成大量I/O的临时文件更有效。

对于您的10000 x 14000矩阵示例,此代码花了3分钟创建转置文件。如果设置MAX_BYTES_PER_CHUNK = 1024 * 100_000而不是1024 * 50_000,则需要2分钟,但当然会消耗更多的RAM。

最新更新