FileInputStream and Huffman Tree



我正在创建一个霍夫曼树来压缩文本文件,但我遇到了一些问题。我所做的这个方法应该采用一个FileInputStream,它输入文本数据并返回字符和计数的Map。然而,要做到这一点,我需要定义byte[]的大小来存储数据。问题是byte[]数组的大小需要正好合适的长度,否则Map也会有一些不需要的数据。有没有办法使byte[]正好合适的大小?

下面是我的代码:
// provides a count of characters in an input file and place in map
public static Map<Character, Integer> getCounts(FileInputStream input)
        throws IOException {
    Map<Character, Integer> output = new TreeMap<Character, Integer>(); // treemap keeps keys in sorted order (chars alphabetized)
    byte[] fileContent = new byte[100]; // creates a byte[]
    //ArrayList<Byte> test = new ArrayList<Byte>();
    input.read(fileContent);                // reads the input into fileContent
    String test = new String(fileContent);  // contains entire file into this string to process
    // goes through each character of String to put chars as keys and occurrences as keys
    for (int i = 0; i < test.length(); i++) {
        char temp = test.charAt(i);
        if (output.containsKey(temp)) { // seen this character before; increase count
            int count = output.get(temp);
            System.out.println("repeat; char is: " + temp + "count is: " + count);
            output.put(temp, count + 1);
        } else {                        // Haven't seen this character before; create count of 1
            System.out.println("new; char is: " + temp + "count is: 1");
            output.put(temp, 1);
        }
    }
    return output;
}

FileInputStream.read()的返回值是实际读取的字节数,在EOF的情况下是-1。您可以在for循环中使用此值而不是test.length()

注意,read()不能保证读取缓冲区长度的字节值,即使没有到达文件的末尾,所以它通常在循环中使用:

int bytesRead;
//Read until there is no more bytes to read.
while((bytesRead = input.read(buf))!=-1)
{
    //You have next bytesRead bytes in a buffer here      
} 

最后,如果字符串是Unicode,这种方法将不起作用,因为read()可以在字符中间终止。考虑使用InputStreamReader来包装FileInputStream:

Reader fileReader = new InputStreamReader(input, "UTF-8");
int charsRead;
char buf[] = new char[256];
while ((charsRead = fileReader.read(buf)) > 0) {
   //You have charsRead characters in a buffer here
}

最新更新