python-pptx:从幻灯片中提取文本时会获得奇数分裂



我正在使用https:///python-pptx.readthedocs.io/en/latest/latest/user/quickstart.html上的"从幻灯片中提取所有文本"示例来自某些PowerPoint幻灯片的文字。

from pptx import Presentation
prs = Presentation(path_to_presentation)
# text_runs will be populated with a list of strings,
# one for each text run in presentation
text_runs = []
for slide in prs.slides:
    for shape in slide.shapes:
        if not shape.has_text_frame:
            continue
        for paragraph in shape.text_frame.paragraphs:
            for run in paragraph.runs:
                text_runs.append(run.text)

它似乎工作正常,除了我在某些text_run中得到了奇怪的分裂。我期望将要分组在一起的事情正在分割,没有明显的模式可以检测到。例如,有时幻灯片标题分为两个部分,有时不是

我发现我可以通过在幻灯片上重新测试文本来消除奇数拆分。

我不能,或者至少不想将拆分文本的两个部分合并在一起,因为有时文本的第二部分已与不同的文本运行合并。例如,在幻灯片甲板的标题幻灯片上,标题将分为两部分,标题的第二部分与标题幻灯片的字幕文本合并。

关于如何消除奇数/不需要的分裂的任何建议?还是从PowerPoint读取文本时,这种行为或无需预期?

我会说这肯定是可以的。PowerPoint随时会拆分运行,也许要突出显示拼写错误的单词,或者只是在打字或进入打字机或其他内容时停止。

关于运行的唯一一件事是,它包含的所有字符都共享相同的字符格式。不能保证,例如,运行是人们所说的"贪婪",包括尽可能多的字符 do 共享相同的字符格式。

如果您想重建跑步中的"贪婪"连贯性,它将取决于您,也许使用这样的算法:

last_run = None
for run in paragraph.runs:
    if last_run is None:
        last_run = run
        continue
    if has_same_formatting(run, last_run):
        last_run = combine_runs(last_run, run)
        continue
    last_run = run

使您可以实现has_same_formatting()combine_runs()。这里有一定的优势,因为跑步可能包含您不在乎的差异,例如肮脏的属性或其他任何东西,您可以选择对您很重要的差异。

has_same_formatting()实现的开始是:

def has_same_formatting(run, run_2):
    font, font_2 = run.font, run_2.font
    if font.bold != font_2.bold:
        return False
    if font.italic != font_2.italic:
        return False
    # ---same with color, size, type-face, whatever you want---
    return True

combine_runs(base, suffix)看起来像这样:

def combine_runs(base, suffix):
    base.text = base.text + suffix.text
    r_to_remove = suffix._r
    r_to_remove.getparent().remove(r_to_remove)

@thegreat-这是我的最终代码块。我不确定如何彻底测试。正如我在其他地方提到的那样,IIRC还有其他事情出现了,我从来没有真正回到这个业余时间。"项目。

try:
    import pptx
except ImportError:
    print("Error when trying to import the pptx module to bobs_useful_functions.py.")
    print("Please install a current version of the python-pptx library.")
    sys.exit(1)
try:
    import pptx.exc
except ImportError:
    print("Error when trying to import the pptx.exc module to bobs_useful_functions.py.")
    print("Please install a current version of the python-pptx library.")
    sys.exit(1)
from pptx import Presentation
from pptx.exc import PackageNotFoundError
def read_text_from_powerpoint(path_to_presentation, only_first_slide=True):
# Adapted from an example at https://python-pptx.readthedocs.io/en/latest/user/quickstart.html
# and the StackOverflow question "python-pptx Extract text from slide titles.
#
# Note: Using the "runs" method described in the python-pptx QuickStart example occasionally
#       resulted in breaks in the text read from the slide, for no obvious reason.
try:
    prs = Presentation(path_to_presentation)
    # text_runs will be populated with a list of strings,
    # one for each text run in presentation
    text_runs = []
    slide_counter = 0
    for slide in prs.slides:
        slide_counter += 1
        if slide_counter == 1:
            for shape in slide.shapes:
                if not shape.has_text_frame:
                    continue
                text_runs.append(shape.text)
        else:
            if only_first_slide:
                break
            else:
                for shape in slide.shapes:
                    if not shape.has_text_frame:
                        continue
                    for paragraph in shape.text_frame.paragraphs:
                        for run in paragraph.runs:
                            text_runs.append(run.text)
    if only_first_slide:
        # This assumes the first string in "text_runs" is the title, which in turn assumes
        # the first slide HAS a title.
        title = ''.join(text_runs[:1])  # Basically, convert from a one-element list to a string
        # Joint with a space between the elements of 'text_runs'.  For the first slide, this would
        # be what's typically thought of as the slide subtitle, plus any notes or comments also on
        # the first slide.
        subtitle = ' '.join(text_runs[1:])
        output = [title, subtitle]
    else:
        output = text_runs
except PackageNotFoundError:
    print("nWARNING: Unable to open the presentation:n    %s" % path_to_presentation)
    print("The presentation may be password protected.")
    # Note that this output text is a treated as a flag value.
    # For that reason, be EXTREMELY careful about changing this output text.
    output = ['PackageNotFoundError - Possible password-protected PowerPoint']
return output

最新更新