加快 openpyxl 阅读速度



我正在开发一个相当大的程序,该程序再次从相当大的 excel 电子表格中获取数据。出于某种奇怪的原因,加载这个大的 excel 文件需要很长时间,我希望以某种方式加快速度。我做了自己的研究并尝试了read_only,但这不允许我读取任何细胞(至少如下所示)。任何建议将非常受欢迎。

def LoadSheet(Filename,worksheet_nr):
    import time as t
    st = t.clock()
    wb = xl.load_workbook(str(Filename))#workbook, loading workbook
    Sheets = wb.get_sheet_names()#Worksheets, shows all sheets in the workbook loaded
    ActiveSheet = wb.get_sheet_by_name(Sheets[worksheet_nr])#Selects the first sheet
    print round(t.clock() - st,3), "seconds of loading time."
    print "Sheet is loaded and reading..."
    return ActiveSheet

def CountRC_nr(ActiveSheet, StartR,StartC):
    R = StartR#Selected starting Row number
    C = StartC#Selected starting Column number
    Val = None #Value in cell (Right now, none is selected, thus zero value)
    while (Val != "None"):
        Val = str(ActiveSheet.cell(row=R,column=C).value)
        R += 1
    NumberRows = R - StartR - 1
    Val = None
    R=StartR
    while (Val != "None"):
        Val = str(ActiveSheet.cell(row=R,column=C).value)
        C += 1
    NumberColumns = C - StartC - 1
    return NumberRows, NumberColumns

由于你们中的一些人要求查看我如何使用read_only,因此以下是修改后的代码:

def LoadSheet(Filename,worksheet_nr):
    import time as t
    st = t.clock()
    wb = xl.load_workbook(str(Filename), read_only = True)#workbook, loading workbook
    Sheets = wb.get_sheet_names()#Worksheets, shows all sheets in the workbook loaded
    ActiveSheet = wb.get_sheet_by_name(Sheets[worksheet_nr])#Selects the first sheet
    print round(t.clock() - st,3), "seconds of loading time."
    print "Sheet is loaded and reading..."
    return ActiveSheet

def CountRC_nr(ActiveSheet, StartR,StartC):
    R = StartR#Selected starting Row number
    C = StartC#Selected starting Column number
    Val = None #Value in cell (Right now, none is selected, thus zero value)
    while (Val != "None"):
        Val = str(ActiveSheet.cell(row=R,column=C).value)
        R += 1
    NumberRows = R - StartR - 1
    Val = None
    R=StartR
    while (Val != "None"):
        Val = str(ActiveSheet.cell(row=R,column=C).value)
        C += 1
    NumberColumns = C - StartC - 1
    return NumberRows, NumberColumns

正如我已经说过的,你正在反对API。如果工作表的维度已知,那么您可以访问它们,否则您可以构建自己的计数器,这没有什么意义。

只读模式对于单个查找非常快,但是由于每次查找都涉及将行中的所有单元格解析为感兴趣的单元格,因此对于其中的许多单元格(例如您的代码正在使用的单元格)来说效率非常低。

wb = load_workbook(filename, read_only=True)
for ws in wb:
    print(ws.dimensions)
    for row in ws.rows:
         for cell in row:
             print(cell.value)  # do whatever you want with the cell

您可以使用(cell.value for cell in row)轻松地将行转换为 NumPy 数组,但 Pandas 库也会将工作表直接读取到数据框中。

最新更新