如何在BeautifulSoup中迭代未测试的html以列表格式提取内容



我的问题有点具体。我一直在看SO上的所有其他BeautifulSoup问题,但还没有找到我的问题的答案。我取了一个pdf文件,把它变成了一个不错的html,打算进一步把它转录成csv文件。

我工作的网页看起来是这样的,只是我编辑了一堆我不确定我想让普通谷歌用户使用的东西:

(RUSI) US Foundation
Last Updated: 2014-12-29
At A Glance
[st # redacted] I St. N.W.
Washington, DC United States 20006
Type of Grantmaker
Independent foundation
Financial Data
(yr. ended 2013-12-31)
Assets: $3,085 Total giving: $0
EIN
[redacted]
990
[redacted]
Application Information
Unsolicited requests for funds not accepted.
Application form not required.
Directors Michael Clarke Sean Murphy Timothy Voake
Financial Data
Year ended 2013-12-31
Assets: $3,085 (market value)
Expenditures: $387
Total giving: $0
Qualifying distributions: $387
Additional Location Information
County: District of Columbia
Metropolitan area: Washington-Arlington-Alexandria, DC-VA-MD-WV Congressional district: District of Columbia District At-large
04Arts Foundation
Last Updated: 2013-05-15
At A Glance
P.O. Box [redacted]
San Antonio, TX United States 78283-1253 Telephone:(210) [redacted] Contact: Penelope Speier URL: www.04arts.org
Type of Grantmaker
Independent foundation
Financial Data
(yr. ended 2012-12-31)
Assets: $40,957 Total giving: $1,698
EIN
[redacted]
990
[redacted]
Additional Contact Information
Application Address: [redacted] Dallas, New Braunfels, TX 78130
Background
Established in 1995 in TX.
Limitations
No grants to individuals.
Fields of Interest Subjects
Arts
Application Information
Application form not required.
Initial approach: Proposal Deadline(s): None
Donor(s)
Note: If a donor is deceased, the symbol (f) follows the name.
Penelope Gallagher William Gallagher Edward Everett Collins, III Edwards Aquifer Authority
Officer
Penelope Speier, Pres.
Directors Wendy W. Atwell Jon Cochran
Financial Data
Year ended 2012-12-31
Assets: $40,957 (market value)
Gifts received: $[redacted] Expenditures: $[redacted] Total giving: $[redacted] Qualifying distributions: $[redacted] Giving activities include:
$[redacted] for grants
Additional Location Information
County: Bexar
Metropolitan area: San Antonio, TX Congressional district: Texas District 35
1 in 9: The Long Island Breast Cancer Action Coalition, Inc
Last Updated: 2011-12-19
At A Glance
[redacted] E. Rockaway Rd.
Hewlett, NY United States 11557-1736 Telephone:(516) [redacted] Fax: (516) [redacted] E-mail: [redacted]
Type of Grantmaker
Public charity
Additional Descriptor
Organization that normally receives a substantial part of its support from a governmental unit or from the general public
EIN
[redacted]
990
[redacted]
Purpose and Activities
The coalition's mission is to promote awareness of the breast cancer epidemic through education, outreach, advocacy, and direct support of research which is being done to find the causes of and cures for breast cancer and other related cancers.
Fields of Interest Subjects
Breast cancer
Breast cancer research
Cancer
Cancer research
Types of Support
Research
Publications
Newsletter
Officers and Directors
Note: An asterisk (*) following an individual's name indicates an officer who is also a trustee or director.
Geri Barish *, Pres.
Louise Levrie, V.P.
Larry Slatky *, Treas.
Caroline Boss Fran Kritchek Frank P. Naudus Leon Newman
Additional Location Information
County: Nassau
Metropolitan area: New York-Northern New Jersey-Long Island, NY-NJ-PA Congressional district: New York District 04

我的html目前看起来是这样的(正是这样,所以请注意,这太可怕了):

<p style="text-align:justify;"><span class="font7" style="color:#CB4810;">FOUNDATION</span></p><a name="caption1"></a><h1 style="text-align:justify;"><a name="bookmark0"></a><span class="font7" style="color:#CB4810;"><a href="https://fconline.foundationcenter.org/">DIRECTORY</a></span></h1><div style="float:right;layout-flow:horizontal;">
<p><span class="font4"><a href="https://fconline.foundationcenter.org/grantmaker-profile/save?html_id=54c1468ec37a7">Save this Page</a></span></p></div>
<p style="text-align:justify;"><span class="font1" style="color:#ED977A;">ONLINE </span><span class="font1" style="color:#9D9D9D;">.*&gt;. </span><span class="font1" style="font-weight:bold;color:#9D9D9D;">A </span><span class="font1" style="color:#9D9D9D;">service of the &nbsp;&nbsp;&nbsp;</span><span class="font1" style="color:#808080;">_ </span><span class="font1">...... _</span></p>
<p style="text-align:right;padding:0pt 0pt 23pt 0pt;"><span class="font4" style="text-decoration:underline;">Print this Page</span></p>
<p style="text-align:justify;padding:23pt 0pt 9pt 0pt;"><span class="font4">(</span><span class="font4" style="font-weight:bold;">Refinements: </span><span class="font4">Grantmaker Name: *)</span></p><h2 style="text-align:justify;padding:9pt 0pt 0pt 0pt;"><a name="bookmark1"></a><span class="font6" style="font-weight:bold;">(RUSI) US Foundation</span></h2>
<p style="text-align:justify;padding:0pt 0pt 14pt 0pt;"><span class="font1" style="font-weight:bold;">Last Updated: </span><span class="font2">2014</span><span class="font0">-</span><span class="font2">12-29</span></p><h3 style="text-align:justify;padding:14pt 0pt 0pt 0pt;"><a name="bookmark2"></a><span class="font5" style="font-weight:bold;">At A Glance</span></h3>
<p style="text-align:justify;"><span class="font4">1776 I St. N.W.</span></p>
<p style="text-align:justify;padding:0pt 0pt 9pt 0pt;"><span class="font4">Washington, DC United States 20006</span></p><h4 style="text-align:justify;padding:9pt 0pt 0pt 0pt;"><a name="bookmark3"></a><span class="font4" style="font-weight:bold;">Type of Grantmaker</span></h4>
<p style="text-align:justify;padding:0pt 0pt 9pt 0pt;"><span class="font4">Independent foundation</span></p><h4 style="text-align:justify;padding:9pt 0pt 0pt 0pt;"><a name="bookmark4"></a><span class="font4" style="font-weight:bold;">Financial Data</span></h4>
<p style="text-align:justify;"><span class="font4">(yr. ended 2013-12-31)</span></p>
<p style="padding:0pt 421pt 9pt 0pt;"><span class="font4">Assets: $3,085 Total giving: $0</span></p><h4 style="text-align:justify;padding:9pt 0pt 0pt 0pt;"><a name="bookmark5"></a><span class="font4" style="font-weight:bold;">EIN</span></h4>
<p style="text-align:justify;padding:0pt 0pt 9pt 0pt;"><span class="font4">721374719</span></p><h4 style="text-align:justify;padding:9pt 0pt 0pt 0pt;"><a name="bookmark6"></a><span class="font4" style="font-weight:bold;">990</span></h4>
<p style="text-align:justify;padding:0pt 0pt 9pt 0pt;"><span class="font4"><a href="http://990s.foundationcenter.org/990pf_pdf_archive/721/721374719/721374719_201312_990PF.pdf">2013 </a><a href="http://990s.foundationcenter.org/990pf_pdf_archive/721/721374719/721374719_200412_990PF.pdf">2004</a><a href="http://990s.foundationcenter.org/990_pdf_archive/721/721374719/721374719_200312_990EZ.pdf"> 2003 </a><a href="http://990s.foundationcenter.org/990pf_pdf_archive/721/721374719/721374719_200212_990PF.pdf">2002</a></span></p><h4 style="text-align:justify;padding:9pt 0pt 0pt 0pt;"><a name="bookmark7"></a><span class="font4" style="font-weight:bold;">Application Information</span></h4>
<p style="text-align:justify;padding:0pt 0pt 9pt 0pt;"><span class="font4">Unsolicited requests for funds not accepted.</span></p>
<p style="text-align:justify;padding:9pt 0pt 14pt 0pt;"><span class="font4">Application form not required.</span></p>
<p style="padding:14pt 421pt 14pt 0pt;"><span class="font4" style="font-weight:bold;">Directors Michael Clarke&nbsp;Sean Murphy&nbsp;Timothy Voake</span></p><h4 style="text-align:justify;padding:14pt 0pt 0pt 0pt;"><a name="bookmark8"></a><span class="font4" style="font-weight:bold;">Financial Data</span></h4>
<p style="text-align:justify;padding:0pt 0pt 9pt 0pt;"><span class="font4" style="font-weight:bold;">Year ended 2013-12-31</span></p>
<p style="text-align:justify;padding:9pt 0pt 0pt 0pt;"><span class="font4">Assets: $3,085 (market value)</span></p>
<p style="text-align:justify;"><span class="font4">Expenditures: $387</span></p>
<p style="text-align:justify;"><span class="font4">Total giving: $0</span></p>
<p style="text-align:justify;padding:0pt 0pt 9pt 0pt;"><span class="font4">Qualifying distributions: $387</span></p><h4 style="text-align:justify;padding:9pt 0pt 0pt 0pt;"><a name="bookmark9"></a><span class="font4" style="font-weight:bold;">Additional Location Information</span></h4>
<p style="text-align:justify;"><span class="font4">County: District of Columbia</span></p>

现在,当我通过运行此代码使用BS时;

from bs4 import BeautifulSoup as Soup
html = Soup(open('found1.html'))
titles = html.find_all('h2', style="text-align:justify;padding:9pt 0pt 0pt 0pt;")
print(titles[0].find(text=True))
print(titles[0].find_next('p', style="text-align:justify;padding:0pt 0pt 14pt 0pt;").
     find_all(text=True))
print(titles[0].find_next('span', class_="font5",
                          style="font-weight:bold;").find(text=True))

我明白;

(RUSI) US Foundation
['Last Updated: ', '2014', '-', '12-29']
At A Glance

这太棒了!下一部分我有困难。我需要掌握"概览"one_answers"Grantmaker类型"之间的所有内容。然后我需要为"Grantmaker类型"和下一盘做这件事。这样做的一个好处是,对于类似的标题,标签几乎总是相同的。例如,这就是我用titles = html....代码获取所有标题的名称的方法。

我想要的输出是这样的列表:

[[first organization, last_updated, at_a_glance, type_of_grantmaker, financial_data, ...], 
[second organization, ...], [third organization, ...], ...]

我们非常感谢朝着正确方向迈出的任何一步!如果你认为我的问题因为任何原因都很糟糕,我会很感激你的评论和-1,这样我就可以解决它了。我是新来的,我最后的问题没有得到很好的回应。。。

事实证明,对我来说,最简单的方法是在将其放入BeautifulSoup之前将其拆分。因此,我所做的是使用以下代码对其进行拆分,然后(目前)正在编写一个函数来很好地处理文本拆分。

from bs4 import BeautifulSoup as Soup
with open('found1.html', 'r') as f:
    html = f.read()
sections = html.split('</a><span class="font6" style="font-weight:bold;">')

# Developing this bit to extract text cleanly.
def extract(html):
    html = Soup(html)
    html.find_all(text=True)
    print(extract)
    print(html.text)

# Gives me the whole html between the first title and the second
print(sections[1])
extract(sections[1])

最新更新