计算来自巨大CSV文件的唯一行数



我有一个巨大的CSV文件(约5-6 GB(,大小为Hive。有没有一种方法来计算文件中存在的唯一行数?

我对此没有任何线索。

我需要将输出与另一个具有相似内容但唯一值的蜂巢表进行比较。因此,基本上我需要找到不同的林恩。

以下逻辑基于哈希工作。它读取每条线的哈希,而不是整个线的散列,这使大小最小化。然后比较哈希。对于相等的字符串而言,哈希通常会相同,很少有字符串可能会有所不同,因此可以肯定的是读取实际线条并比较实际的字符串。以下也应该适用于大型文件。

from collections import Counter
input_file = r'input_file.txt'
# Main logic
# If hash is different then the contents are different
# If hash is same then the contents may be different

def count_with_index(values):
    '''
    Returns dict like key: (count, [indexes])
    '''
    result = {}
    for i, v in enumerate(values):
        count, indexes = result.get(v, (0, []))
        result[v] = (count + 1, indexes + [i])
    return result

def get_lines(fp, line_numbers):
    return (v for i, v in enumerate(fp) if i in line_numbers)

# Gets hashes of all lines
counter = count_with_index(map(hash, open(input_file)))
# Sums only the unique hashes
sum_of_unique_hash = sum((c for _, (c, _) in counter.items() if c == 1))
# Filters all non unique hashes
non_unique_hash = ((h, v) for h, (c, v) in counter.items() if c != 1)
total_sum = sum_of_unique_hash
# For all non unique hashes get the actual line and count
# One hash is picked per time. So memory is not consumed much.
for h, v in non_unique_hash:
    counter = Counter(get_lines(open(input_file), v))
    total_sum += sum(1 for k, v in counter.items())
print('Total number of unique lines is : ', total_sum)

最新更新