从抓取的数据中删除空白/空格/换行符



我使用beautiful soup从url中抓取数据。但是在清理后,在清理的数据中有许多空格/空白/换行符。我试着用.strip()函数去掉这些。但它仍然存在。

from bs4 import BeautifulSoup
import requests
import re
URL="https://www.flagstaffsymphony.org/event/a-flag-on-fourth/"
html_content = requests.get(URL).text
cleantext = BeautifulSoup(html_content, "lxml").text
cleanr = re.compile('<.*?>')
clean_data = re.sub(cleanr, ' ', cleantext)
text = re.sub('([^x00-x7F]+)|(n)|(t)',' ', clean_data)
with open('read.txt', 'w') as file:
file.writelines(text)

America the Beautiful: A Virtual Patriotic Salute   Flagstaff Symphony Orchestra                                                                                           Contact             Hit enter to search or ESC to close                                     About  Our Team Our Conductor Orchestra Members   Concerts & Events  Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs   Support The FSO  Donate to FSO Sponsor a Chair Funding and Impact   Videos Donate Subscription Tickets                  All Events   This event has passed. America the Beautiful: A Virtual Patriotic Salute  July 4, 2020         Violin Virtuoso Beethoven Virtual 5k             In place of our traditional 4th of July concert at the Pepsi Amphitheater, the Flagstaff Symphony Orchestra will present a virtual patriotic salute to be released HERE and our Facebook page at on July 4, 2020 at 11am. The FSO is proud to offer a special rendition of  America the Beautiful  performed by 60 of their professional musicians, coming together virtually, to celebrate our nation s independence. CLICK HERE FOR DETAILS   + Google Calendar+ iCal Export     Details    Date:    July 4, 2020   Event Category: Concerts and Events             Violin Virtuoso Beethoven Virtual 5k                   Concert InfoConcerts Concerts and Events FAQs     FSO InfoAbout FSO Mission and History Our Team Our Conductor Orchestra Members     Support FSOMake a Donation Underwriting a Concert Sponsor a Chair Advertise with FSO Volunteer Leave a Legacy Donor Bill of Rights Code of Ethical Standards  (Used by permission of the Association of Fundraising Professionals)     ResourcesCommunity & Education For Musicians For Board Members             2021 Flagstaff Symphony Orchestra. 
Copyright 2019 Flagstaff Symphony Association                             About  Our Team Our Conductor Orchestra Members   Concerts & Events  Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs   Support The FSO  Donate to FSO Sponsor a Chair Funding and Impact   Videos Donate Subscription Tickets   Contact  
在上面的代码中,我用' ' (blank)替换了unicode字符。如果我没有用空格替换,那么几个单词将被连接在一起。我想获得的是一个字符串数据类型,没有不必要的空格和新的行数据。 <<p>

添加问题/strong>我尝试了各种方法,如strip(), re.sub()等,以替换文本中某些行开头的空格。但是对于以下数据

不起作用
Subscription Tickets
All Events
This event has passed.
America the Beautiful: A Virtual Patriotic Salute
July 4, 2020
Violin Virtuoso
Beethoven Virtual 5k 

如何去掉这些空格

您可以尝试:

print(re.sub('s+',' ', text))

Try this:

from bs4 import BeautifulSoup
import requests
import re

URL="https://www.flagstaffsymphony.org/event/a-flag-on-fourth/"
html_content = requests.get(URL).text
cleantext = BeautifulSoup(html_content, "lxml").text
cleanr = re.compile('<.*?>')
clean_data = re.sub(cleanr, ' ', cleantext)
text = re.sub('s+', ' ', clean_data)
print(text)
with open('read.txt', 'w') as file:
file.writelines(text)

输出:

America the Beautiful: A Virtual Patriotic Salute – Flagstaff Symphony Orchestra Contact Hit enter to search or ESC to close About Our Team Our Conductor Orchestra Members Concerts & Events Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs Support The FSO Donate to FSO Sponsor a Chair Funding and Impact Videos Donate Subscription Tickets « All Events This event has passed. America the Beautiful: A Virtual Patriotic Salute July 4, 2020 « Violin Virtuoso Beethoven Virtual 5k » In place of our traditional 4th of July concert at the Pepsi Amphitheater, the Flagstaff Symphony Orchestra will present a virtual patriotic salute to be released HERE and our Facebook page at on July 4, 2020 at 11am. The FSO is proud to offer a special rendition of “America the Beautiful” performed by 60 of their professional musicians, coming together virtually, to celebrate our nation’s independence. CLICK HERE FOR DETAILS + Google Calendar+ iCal Export Details Date: July 4, 2020 Event Category: Concerts and Events « Violin Virtuoso Beethoven Virtual 5k » Concert InfoConcerts Concerts and Events FAQs FSO InfoAbout FSO Mission and History Our Team Our Conductor Orchestra Members Support FSOMake a Donation Underwriting a Concert Sponsor a Chair Advertise with FSO Volunteer Leave a Legacy Donor Bill of Rights Code of Ethical Standards (Used by permission of the Association of Fundraising Professionals) ResourcesCommunity & Education For Musicians For Board Members © 2021 Flagstaff Symphony Orchestra. © Copyright 2019 Flagstaff Symphony Association About Our Team Our Conductor Orchestra Members Concerts & Events Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs Support The FSO Donate to FSO Sponsor a Chair Funding and Impact Videos Donate Subscription Tickets Contact

是否要保留一些空白空间以提高可读性尚不清楚。如果你做,你可以试试这种方法:

更新:添加的代码只保留字母数字字符,除了一个角色排除列表。

代码:

from bs4 import BeautifulSoup
import requests

def clean_scraped_text(raw_text):
# strip whitespaces from start and end of raw text
stripped_text = raw_text.strip()
processed_text = ''
for i, char in enumerate(stripped_text):
# add a single 'n' to processed_text for every sequence of 'n'
if char == 'n':
if stripped_text[i - 1] != 'n':
processed_text += 'n'
else:
# if character is not 'n' add it to new_text
processed_text += char
# clean whitespaces from each line in new_text
cleaned_text = ''
for line in processed_text.splitlines():
# only retain alphanumeric characters and listed characters 
exclude_list = [' ', 'xa0', '-']
line = ''.join(x for x in line if x.isalnum() or (x in exclude_list))
cleaned_text += line.strip() + 'n'
return cleaned_text
URL="https://www.flagstaffsymphony.org/event/a-flag-on-fourth/"
html_content = requests.get(URL).text
text = BeautifulSoup(html_content, "lxml").text
print(clean_scraped_text(text))

输出:

America the Beautiful A Virtual Patriotic Salute  Flagstaff Symphony Orchestra
Contact
Hit enter to search or ESC to close

About
Our Team
Our Conductor
Orchestra Members
Concerts  Events
Season 72 Concerts
Subscribe
Venue Parking  Concerts FAQs
Support The FSO
Donate to FSO
Sponsor a Chair
Funding and Impact
Videos
Donate
Subscription Tickets
All Events
This event has passed
America the Beautiful A Virtual Patriotic Salute
July 4 2020
Violin Virtuoso
Beethoven Virtual 5k
In place of our traditional 4th of July concert at the Pepsi Amphitheater the Flagstaff Symphony Orchestra will present a virtual patriotic salute to be released HERE and our Facebook page at on July 4 2020 at 11am The FSO is proud to offer a special rendition of America the Beautiful performed by 60 of their professional musicians coming together virtually to celebrate our nations independence
CLICK HERE FOR DETAILS
Google Calendar iCal Export
Details
Date
July 4 2020
Event Category Concerts and Events
Violin Virtuoso
Beethoven Virtual 5k
Concert InfoConcerts
Concerts and Events FAQs
FSO InfoAbout FSO Mission and History
Our Team
Our Conductor
Orchestra Members
Support FSOMake a Donation
Underwriting a Concert
Sponsor a Chair
Advertise with FSO
Volunteer
Leave a Legacy
Donor Bill of Rights
Code of Ethical Standards  Used by permission of the Association of Fundraising Professionals
ResourcesCommunity  Education
For Musicians
For Board Members
2021 Flagstaff Symphony Orchestra
Copyright 2019 Flagstaff Symphony Association

About
Our Team
Our Conductor
Orchestra Members
Concerts  Events
Season 72 Concerts
Subscribe
Venue Parking  Concerts FAQs
Support The FSO
Donate to FSO
Sponsor a Chair
Funding and Impact
Videos
Donate
Subscription Tickets
Contact

最新更新