Openpyxl - 在 ReadOnlyWorksheet 中查找隐藏的行



我在以只读模式打开工作簿时检测隐藏了哪些行时遇到问题。

当我在加载工作簿时将read_only参数设置为 False 时,它可以完美运行,因为这样我就可以遍历row_dimensions以检查哪些行被隐藏 - 但在读写模式下打开工作簿需要更长的时间(~2 分钟与只读模式下的 ~20 秒(并消耗超过 1GB 的 RAM。

不幸的是,只读工作表没有row_dimensions属性。

欢迎任何帮助。

根本问题是解析器只使用一次,并在迭代所有行后被丢弃。这就是read_only模式优化内存分配并根据请求生成行的方式。有趣的是,解析器本身仍在创建包含行属性的row_dimensions

您可以尝试一些解决方法。为了代替分叉和创建公开 ReadOnlyWorksheet 解析器的官方修复程序,我使用了猴子补丁:

from openpyxl.worksheet._read_only import ReadOnlyWorksheet, WorkSheetParser, EMPTY_CELL
# The override:
class MyReadOnlyWorksheet(ReadOnlyWorksheet):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.parser = None
    def row_is_hidden(self, row_index):
        str_row_index = str(row_index)
        if self.parser and str_row_index in self.parser.row_dimensions:
            return self.parser.row_dimensions[str_row_index].get('hidden') == '1'
        if self.parser is None or row_index > self.parser.row_counter:
            raise RuntimeError('Must generate the row before calling')
        return False
    def _cells_by_row(self, min_col, min_row, max_col, max_row, values_only=False):
        """
        The source worksheet file may have columns or rows missing.
        Missing cells will be created.
        Logically the same but saves the parser to "self" during row iteration
        """
        filler = EMPTY_CELL
        if values_only:
            filler = None
        max_col = max_col or self.max_column
        max_row = max_row or self.max_row
        empty_row = []
        if max_col is not None:
            empty_row = (filler,) * (max_col + 1 - min_col)
        counter = min_row
        idx = 1
        src = self._get_source()
        parser = WorkSheetParser(src, self._shared_strings,
                                 data_only=self.parent.data_only, epoch=self.parent.epoch,
                                 date_formats=self.parent._date_formats)
        ### Cache parser in order to check generated row attrs ###
        self.parser = parser
        for idx, row in parser.parse():
            if max_row is not None and idx > max_row:
                break
            # some rows are missing
            for _ in range(counter, idx):
                counter += 1
                yield empty_row
            # return cells from a row
            if counter <= idx:
                row = self._get_row(row, min_col, max_col, values_only)
                counter += 1
                yield row
        if max_row is not None and max_row < idx:
            for _ in range(counter, max_row+1):
                yield empty_row
        src.close()
# the monkey patch:
import openpyxl.reader.excel
openpyxl.reader.excel.ReadOnlyWorksheet = MyReadOnlyWorksheet
# the test drive:
from openpyxl import load_workbook
file_location = ''  # load your file
workbook = load_workbook(file_location, data_only=True, keep_links=False, read_only=True)
for worksheet in workbook.worksheets:
    row_gen = worksheet.rows
    for i, row in enumerate(row_gen, start=1):
        if worksheet.row_is_hidden(i):
            continue  # do not process hidden rows.

这符合您的需求,但要小心!在生产中使用之前,我会添加足够的测试覆盖率(想想未来版本重新键入row_dimension字典、从read_only解析中删除row_dimensions等(。同样,您可以将自己的访问器添加到公开其他行属性的工作表中(或返回整个字典(。

祝您编码愉快!

最新更新