Python中有效搜索算法在Excel工作簿的所有工作表中搜索字符串并返回匹配的表号码



如何在工作簿的所有床单中搜索字符串/模式并返回工作簿的所有匹配表数?

我可以一一遍历Excel工作簿中的所有床单,然后在每个纸上搜索字符串(如线性搜索(,但效率低下,需要很长时间,我必须处理数百个工作簿或甚至更多。

更新1:示例代码

from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
def searchSheets(fnames):
    #Search Logic here
    #Loop over each Sheet
    #Search for string 'Balance' in each Sheet
    #Return matching Sheet Number
if __name__ == '__main__':
    __spec__ = None
    folder = "C://AB//"
    if os.path.exists(folder):
        files = glob.glob(folder + "*.xlsx")

    #Multi threading   
    pool = Pool()
    pool=ThreadPool(processes=10)
    #Suggested by @Dan D
    pool.map(searchSheets,files) # It did not work
    pool.close()    

更新2:错误

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "C:ProgramDataAnaconda3libmultiprocessingpool.py", line 119, in work
er
    result = (True, func(*args, **kwds))
  File "C:ProgramDataAnaconda3libmultiprocessingpool.py", line 44, in mapst
ar
    return list(map(*args))
  File "C:temp3.py", line 36, in searchSheet
    wb = xl_wb(f)
  File "C:ProgramDataAnaconda3libsite-packagesxlrd__init__.py", line 116,
in open_workbook
    with open(filename, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'C'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "C:temp3.py", line 167, in <module>
    pool.map(searchSheet,files)
  File "C:ProgramDataAnaconda3libmultiprocessingpool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:ProgramDataAnaconda3libmultiprocessingpool.py", line 644, in get
    raise self._value
FileNotFoundError: [Errno 2] No such file or directory: 'C'
>>>

表格中的搜索并不取决于以前的搜索,并且在工作簿中的搜索并不取决于以前的搜索。这是您可以进行多线程的典型情况。

这篇文章描述了在Python中进行操作的方法如何在Python中使用螺纹?

因此,在伪代码中:

  • 在每个工作簿的每张纸上并行进行搜索
  • 敏捷和现在的结果。

解决方案

from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
def searchSheets(fnames):
    #Search Logic here
    #Loop over each Sheet
    #Search for string 'Balance' in each Sheet
    #Return matching Sheet Number
if __name__ == '__main__':
    __spec__ = None
    folder = "C://AB//"
    if os.path.exists(folder):
        files = glob.glob(folder + "*.xlsx")

    #Multi threading   
    pool = Pool()
    pool=ThreadPool(processes=10)
    #Suggested by @Dan D
    #pool.map(searchSheets,files) # It did not work
    pool.map(searchSheets,[workbook for workbook in files],)
    multiprocessing.freeze_support() # this line is needed on window 
    #only,found it in may other posts
    pool.close()    
    #pool.join() #Removed this from code as it made all the workers to wait

最新更新