我正在python中开发一个代码,该代码在docx文件中搜索某些变量,例如找到单词"car"并用定义的颜色突出显示它。
我使用docx模块来识别和突出显示文本,我可以在运行级别(run.font.h高亮(应用更改,但由于MS Word将文本存储在一个跟踪所有更改的xml文件中,所以我要查找的单词可以通过不同的运行进行拆分,也可以作为长句的一部分。由于我的最终目标是针对一个或多个定义的单词,所以我很难达到预期的结果。
我的主要想法是运行一个函数来"清理"运行或xml文件,在孤立的运行中突出显示我的目标单词,但我还没有找到任何关于这方面的文档,我担心会丢失字体属性、样式等
这是我目前拥有的代码:
import docx
from docx.enum.text import WD_COLOR_INDEX
import re
doc = docx.Document('demo.docx')
words = {'car': 'RED',
'bus': 'GREEN',
'train station': 'BLUE'}
for word, color in words.items():
w = re.compile(fr'b{word}b')
for par in doc.paragraphs:
for run in par.runs:
s = re.findall(w, run.text)
if s:
run.font.highlight_color = getattr(WD_COLOR_INDEX, color)
doc.save('new.docx')
有没有人遇到过同样的问题,或者对不同的方法有想法?
感谢
此函数可用于根据从paragraph.text
上的正则表达式匹配中获得的match.start()
和match.end()
值来隔离段落中的运行。从那里,您可以在不影响相邻文本的情况下随意更改返回运行的属性:
def isolate_run(paragraph, start, end):
"""Return docx.text.Run object containing only `paragraph.text[start:end]`.
Runs are split as required to produce a new run at the `start` that ends at `end`.
Runs are unchanged if the indicated range of text already occupies its own run. The
resulting run object is returned.
`start` and `end` are as in Python slice notation. For example, the first three
characters of the paragraph have (start, end) of (0, 3). `end` is not the index of
the last character. These correspond to `match.start()` and `match.end()` of a regex
match object and `s[start:end]` of Python slice notation.
"""
rs = tuple(paragraph._p.r_lst)
def advance_to_run_containing_start(start, end):
"""Return (r_idx, start, end) triple indicating start run and adjusted offsets.
The start run is the run the `start` offset occurs in. The returned `start` and
`end` values are adjusted to be relative to the start of `r_idx`.
"""
# --- add 0 at end so `r_ends[-1] == 0` ---
r_ends = tuple(itertools.accumulate(len(r.text) for r in rs)) + (0,)
r_idx = 0
while start >= r_ends[r_idx]:
r_idx += 1
skipped_rs_offset = r_ends[r_idx - 1]
return rs[r_idx], r_idx, start - skipped_rs_offset, end - skipped_rs_offset
def split_off_prefix(r, start, end):
"""Return adjusted `end` after splitting prefix off into separate run.
Does nothing if `r` is already the start of the isolated run.
"""
if start > 0:
prefix_r = copy.deepcopy(r)
r.addprevious(prefix_r)
r.text = r.text[start:]
prefix_r.text = prefix_r.text[:start]
return end - start
def split_off_suffix(r, end):
"""Split `r` at `end` such that suffix is in separate following run."""
suffix_r = copy.deepcopy(r)
r.addnext(suffix_r)
r.text = r.text[:end]
suffix_r.text = suffix_r.text[end:]
def lengthen_run(r, r_idx, end):
"""Add prefixes of following runs to `r` until `end` is reached."""
while len(r.text) < end:
suffix_len_reqd = end - len(r.text)
r_idx += 1
next_r = rs[r_idx]
if len(next_r.text) <= suffix_len_reqd:
# --- subsume next run ---
r.text = r.text + next_r.text
next_r.getparent().remove(next_r)
continue
if len(next_r.text) > suffix_len_reqd:
# --- take prefix from next run ---
r.text = r.text + next_r.text[:suffix_len_reqd]
next_r.text = next_r.text[suffix_len_reqd:]
r, r_idx, start, end = advance_to_run_containing_start(start, end)
end = split_off_prefix(r, start, end)
# --- if run is longer than isolation-range we need to split-off a suffix run ---
if len(r.text) > end:
split_off_suffix(r, end)
# --- if run is shorter than isolation-range we need to lengthen it by taking text
# --- from subsequent runs
elif len(r.text) < end:
lengthen_run(r, r_idx, end)
return Run(r, paragraph)
这比人们想象的要复杂得多;它肯定比我刚开始工作时想象的要复杂。无论如何,它有时会派上用场。