一次从 Python 字符串中删除所有可能不需要的字符

我正在使用python模块newspaper3k并使用其Web URL提取文章摘要。如

from newspaper import Article
article = Article('https://www.abcd....vnn.com/dhdhd')
article.download()
article.parse()
article.nlp()
text = article.summary
print (text)

给

Often hailed as Hollywoodxe2x80x99s long standing, commercially successful filmmaker, Spielbergxe2x80x99s lifetime gross, if you include his productions, reaches a mammothxc2xa0$17.2 billionxc2xa0xc2xadxe2x80x93 unadjusted for inflation.
rrThe originalxc2xa0Jurassic Parkxc2xa0($983.8 million worldwide), which released in 1993, remains Spielbergxe2x80x99s highest grossing film.
Ready Player One,xc2xa0currently advancing at a running total of $476.1 million, has become Spielbergxe2x80x99s seventh highest grossing film of his career.It will eventually supplant Aamirxe2x80x99s 2017 blockbusterxc2xa0Dangalxc2xa0(1.29 billion yuan) if it achieves the Maoyanxe2x80x99s lifetime forecast of 1.31 billion yuan ($208 million) in the PRC.

我只想删除所有不需要的字符，例如xe2x80x99s.我避免使用多replace功能。我想要的只是：-

Often hailed as Hollywood long standing, commercially successful filmmaker, 
Spielberg lifetime gross, if you include his productions, reaches a 
mammoth $17.2 billion unadjusted for inflation.
The original Jurassic Park ($983.8 million worldwide), 
which released in 1993, remains Spielberg highest grossing film.
Ready Player One,currently advancing at a running total of $476.1 million, 
has become Spielberg seventh highest grossing film of his career.
It will eventually supplant Aamir 2017 blockbuster Dangal (1.29 billion yuan) 
if it achieves the Maoyan lifetime forecast of 1.31 billion yuan ($208 million) in the PRC

尝试使用正则表达式：

import re
clear_str = re.sub(r'[xe2x80x99s]', '', your_input)

re.sub将模式在your_input中出现的所有实例替换为第二个参数。像[abc]这样的模式匹配a、b或c字符。

你可以使用 python 的encode/decode来摆脱每一个非拉丁字符

data = text.decode('utf-8')
text = data.encode('latin-1', 'ignore')

首先使用.encode('ascii',errors='ignore')忽略所有非 ASCII 字符。

如果您需要此文本进行某种情绪分析，那么您可能还想删除特殊字符，如n、r等，这可以通过首先转义转义字符来完成，然后在正则表达式的帮助下替换它们。

from newspaper import Article
import re
article = Article('https://www.abcd....vnn.com/dhdhd')
article.download()
article.parse()
article.nlp()
text = article.summary
text = text.encode('ascii',errors='ignore')
text = str(text) #converts `n` to `\n` which can then be replaced by regex
text = re.sub('\.','',text) #Removes all substrings of form \.
print (text)

文章解码不正确。它可能在网站上指定了错误的编码，但在问题中没有有效的 url 来重现难以证明的输出。

转义码指示 utf8 是正确的编码，因此使用以下代码直接编码回字节(latin1 是从前 256 个 Unicode 代码点到字节的 1：1 映射(，然后使用 utf8 解码：

text = text.encode('latin1').decode('utf8')

结果：

斯皮尔伯格经常被誉为好莱坞长期在商业上取得成功的电影制片人，如果算上他的作品，他的终身总收入达到了惊人的 172 亿美元——未经通货膨胀调整。
1993年上映的原版《侏罗纪公园》(全球票房9.838亿美元(仍然是斯皮尔伯格票房最高的电影。《头号玩家》目前以4.761亿美元的票房总额前进，成为斯皮尔伯格职业生涯票房第七高的电影。如果它在中国达到猫眼的终身预测，它将最终取代阿米尔2017年的大片Dangal(12.9亿元人民币(。

相关内容

最新更新

热门标签：