我一直在尝试使用pypandoc将下面代码中的HTML字符串question_text_HTML(这是一个用HTML编写的数学问题(转换为latex字符串。但是它不断地包括不相关的字符串;\protect\hypertarget{MJX-…}……";在转换后的字符串中
import pypandoc
from selenium import webdriver
driver.get("https://nigerianscholars.com/past-questions/mathematics/?
show_answers=yes")
question_blocks=driver.find_elements_by_class_name('question_block')
for question_block in question_blocks:
question_text=question_block.find_element_by_class_name('question_text')
question_text_html=question_text.get_attribute('innerHTML')
question_latex=pypandoc.convert_text(question_text_html,'tex',format='html')
print(f'Question Html is {question_text_html}')
print(f'Question latex is {question_latex}')
它通常会给出
Question Html is <html><body><p class="q_question">Differentiate <span class="MathJax_Preview" style="color: inherit;"></span><span class="mjx-chtml MathJax_CHTML" data-mathml='<math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mn>2</mn><mi>x</mi><mo>+</mo><mn>5</mn><msup><mo stretchy="false">)</mo><mn>2</mn></msup><mo stretchy="false">(</mo><mi>x</mi><mo>&#x2212;</mo><mn>4</mn><mo stretchy="false">)</mo></math>' id="MathJax-Element-1-Frame" role="presentation" style="font-size: 114%; position: relative;" tabindex="0"><span aria-hidden="true" class="mjx-math" id="MJXc-Node-1"><span class="mjx-mrow" id="MJXc-Node-2"><span class="mjx-mo" id="MJXc-Node-3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.485em; padding-bottom: 0.572em;">(</span></span><span class="mjx-mn" id="MJXc-Node-4"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.397em; padding-bottom: 0.353em;">2</span></span><span class="mjx-mi" id="MJXc-Node-5"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.221em; padding-bottom: 0.309em;">x</span></span><span class="mjx-mo MJXc-space2" id="MJXc-Node-6"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.309em; padding-bottom: 0.441em;">+</span></span><span class="mjx-mn MJXc-space2" id="MJXc-Node-7"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.397em; padding-bottom: 0.353em;">5</span></span><span class="mjx-msubsup" id="MJXc-Node-8"><span class="mjx-base"><span class="mjx-mo" id="MJXc-Node-9"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.485em; padding-bottom: 0.572em;">)</span></span></span><span class="mjx-sup" style="font-size: 70.7%; vertical-align: 0.513em; padding-left: 0px; padding-right: 0.071em;"><span class="mjx-mn" id="MJXc-Node-10" style=""><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.397em; padding-bottom: 0.353em;">2</span></span></span></span><span class="mjx-mo" id="MJXc-Node-11"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.485em; padding-bottom: 0.572em;">(</span></span><span class="mjx-mi" id="MJXc-Node-12"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.221em; padding-bottom: 0.309em;">x</span></span><span class="mjx-mo MJXc-space2" id="MJXc-Node-13"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.309em; padding-bottom: 0.441em;">−</span></span><span class="mjx-mn MJXc-space2" id="MJXc-Node-14"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.397em; padding-bottom: 0.353em;">4</span></span><span class="mjx-mo" id="MJXc-Node-15"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.485em; padding-bottom: 0.572em;">)</span></span></span></span><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mn>2</mn><mi>x</mi><mo>+</mo><mn>5</mn><msup><mo stretchy="false">)</mo><mn>2</mn></msup><mo stretchy="false">(</mo><mi>x</mi><mo>−</mo><mn>4</mn><mo stretchy="false">)</mo></math></span></span><script id="MathJax-Element-1" type="math/tex">(2x+5)^2(x-4)</script> with respect to x.</p></body></html>
Question latex is Differentiate
{}protecthypertarget{MathJax-Element-1-Frame}{}{protecthypertarget{MJXc-Node-1}{}{protecthypertarget{MJXc-Node-2}{}{protecthypertarget{MJXc-Node-3}{}{{(}}protecthypertarget{MJXc-Node-4}{}{{2}}protecthypertarget{MJXc-Node-5}{}{{x}}protecthypertarget{MJXc-Node-6}{}{{+}}protecthypertarget{MJXc-Node-7}{}{{5}}protecthypertarget{MJXc-Node-8}{}{{protecthypertarget{MJXc-Node-9}{}{{)}}}{protecthypertarget{MJXc-Node-10}{}{{2}}}}protecthypertarget{MJXc-Node-11}{}{{(}}protecthypertarget{MJXc-Node-12}{}{{x}}protecthypertarget{MJXc-Node-13}{}{{−}}protecthypertarget{MJXc-Node-14}{}{{4}}protecthypertarget{MJXc-Node-15}{}{{)}}}}{((2x + 5)^{2}(x - 4))}}((2x+5)^2(x-4))
with respect to x.
我怎样才能删除所有";\protect\hypertarget{MJXc-Node-10}";来自只留下的胶乳
Differentiate {((2x + 5)^{2}(x - 4))}}((2x+5)^2(x-4))
with respect to x.
对于MathJax,方程最初实际上是用TeX表示法存在的。跨度由MathJaxJavascript创建,用于HTML中的公式布局。目前,您让MathJax首先渲染方程,获取渲染的方程,然后尝试将其转换回原始TeX方程。直接读取TeX方程会更简单,而无需间接的Javascript渲染。
要实现这一点,您只需要在Selenium中禁用Javascript。例如,对于Firefox驱动程序,这应该可以做到:
from selenium.webdriver.firefox.options import Options
from selenium import webdriver
opts = Options()
opts.preferences.update({
"javascript.enabled": False,
})
driver = webdriver.Firefox(options=opts)
或者,如果出于某种原因需要在启用Javascript的情况下处理渲染版本,可以尝试在<p>
中获取script元素的内容。它包含完整的方程式,但没有TeX数学标记:
<p class="q_question">...<script type="math/tex">(2x+5)^2(x-4)</script>...</p>
这样就不必删除跨度。然后,您需要将其包含在PDF的TeX数学标记(...)
中。