在 p 元素之间派生文本



我想为一堆下载的文件提取 P 强元素之间的文本。 我想要 P 强"高管"和 P 强"分析师"之间的所有 P 文本,我附上了一个 html 的例子,见示例 我知道如何加载html,但我不知道如何使用BS4提取前面提到的数据:

import textwrap
import os
from bs4 import BeautifulSoup
directory ='C:/test/out'
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html.parser')

html 的示例:

</header><div id="a-cont"><div class="p p1"></div><div class="sa-art article-width" id="a-body"><p>Apple, Inc. (NASDAQ:<a href="https://seekingalpha.com/symbol/AAPL" title="Apple Inc.">AAPL</a>)</p>
<p>Q4 2016 Earnings Call</p>
<p>October 25, 2016 5:00 pm ET</p>
<p><strong>Executives</strong></p>
<p>Nancy Paxton - Apple, Inc.</p>
<p>Timothy Donald Cook - Apple, Inc.</p>
<p>Luca Maestri - Apple, Inc.</p>
<p><strong>Analysts</strong></p>
<p>Eugene Charles Munster - Piper Jaffray &amp; Co.</p>
<p>Kathryn Lynn Huberty - Morgan Stanley &amp; Co. LLC</p>
<p>Shannon S. Cross - Cross Research LLC</p>
<p>Antonio M. Sacconaghi - Sanford C. Bernstein &amp; Co. LLC</p>
<p>Simona K. Jankowski - Goldman Sachs &amp; Co.</p>
<p>Steven M. Milunovich - UBS Securities LLC</p>
<p>Wamsi Mohan - Bank of America Merrill Lynch</p>
<p>James D. Suva - Citigroup Global Markets, Inc. (Broker)</p>
<p>Rod B. Hall - JPMorgan Securities LLC</p>

IIUC,一个非常粗略的解决方案可能是:

from bs4 import BeautifulSoup
s = '''
<div id="a-cont"><div class="p p1"></div><div class="sa-art article-width" id="a-body"><p>Apple, Inc. (NASDAQ:<a href="https://seekingalpha.com/symbol/AAPL" title="Apple Inc.">AAPL</a>)</p>
<p>Q4 2016 Earnings Call</p>
<p>October 25, 2016 5:00 pm ET</p>
<p><strong>Executives</strong></p>
<p>Nancy Paxton - Apple, Inc.</p>
<p>Timothy Donald Cook - Apple, Inc.</p>
<p>Luca Maestri - Apple, Inc.</p>
<p><strong>Analysts</strong></p>
<p>Eugene Charles Munster - Piper Jaffray &amp; Co.</p>
<p>Kathryn Lynn Huberty - Morgan Stanley &amp; Co. LLC</p>
<p>Shannon S. Cross - Cross Research LLC</p>
<p>Antonio M. Sacconaghi - Sanford C. Bernstein &amp; Co. LLC</p>
<p>Simona K. Jankowski - Goldman Sachs &amp; Co.</p>
<p>Steven M. Milunovich - UBS Securities LLC</p>
<p>Wamsi Mohan - Bank of America Merrill Lynch</p>
<p>James D. Suva - Citigroup Global Markets, Inc. (Broker)</p>
<p>Rod B. Hall - JPMorgan Securities LLC</p>
'''
bsobj = BeautifulSoup(s, "lxml")
res = []
for i in bsobj.find('strong').find_all_next('p'):
if i.text == 'Analysts':
break
else:
res.append(i.text)
res

你会得到:

['Nancy Paxton - Apple, Inc.',
'Timothy Donald Cook - Apple, Inc.',
'Luca Maestri - Apple, Inc.']

在OP的进一步解释之后,最终的代码应该是这样的:

import textwrap
import os
from bs4 import BeautifulSoup
res = {}
directory ='C:/Research syntheses - Meta analysis/Transcripts/test/1/'
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html.parser')
res[filename] = []
for i in soup.find('strong').find_all_next('p'):
if i.text == 'Analysts':
break
else:
res[filename].append(i.text)

最新更新