我遇到了一个奇怪的正则表达式问题,我的正则表达式适用于pythex,但不适用于python本身。我现在正在使用 2.7。我想删除所有 unicode 实例,例如 x92
,其中有很多(如'Thomas Bradley x93Bradx94 Garza',
:
import re, requests
def purify(string):
strange_issue = r"""\t<td><font size=2>G<td><a href="http://facebook.com/KilledByPolice/posts/625590984135709" target=new><font size=2><center>facebook.com/KilledByPolice/posts/625590984135709t</a><td><a href="http://www.orlandosentinel.com/news/local/lake/os-leesburg-officer-involved-shooting-20130507"""
unicode_chars_rgx = r"[\][x]d+"
unicode_matches = re.findall(unicode_chars_rgx, string)
bad_list = [strange_issue]
bad_list.extend(unicode_matches)
for item in bad_list:
string = string.replace(item, "")
return string
name_rgx = r"(?:[<][TDtd][>])|(?:target[=]new[>])(?P<the_deceased>[A-Z].*?)[,]"
urls = {2013: "http://www.killedbypolice.net/kbp2013.html",
2014: "http://www.killedbypolice.net/kbp2014.html",
2015: "http://www.killedbypolice.net/" }
names_of_the_dead = []
for url in urls.values():
response = requests.get(url)
content = response.content
people_killed_by_police_that_year_alone = re.findall(name_rgx, content)
for dead_person in people_killed_by_police_that_year_alone:
names_of_the_dead.append(purify(dead_person))
dead_americans_as_string = ", ".join(names_of_the_dead)
print("RIP, {} since 2013:n".format(len(names_of_the_dead))) # 3085! :)
print(dead_americans_as_string)
In [95]: unicode_chars_rgx = r"[\][x]d+"
In [96]: testcase = "Myron Dex92Shawn May"
In [97]: x = purify(testcase)
In [98]: x
Out[98]: 'Myron Dex92Shawn May'
In [103]: match = re.match(unicode_chars_rgx, testcase)
In [104]: match
如何才能把这些x00
字符弄出来?谢谢
当然不是试图找到看起来像" \x00
"的东西。
如果要销毁数据:
>>> re.sub('[x7f-xff]', '', "Myron Dex92Shawn May")
'Myron DeShawn May'
更多的工作,但尽量保留文本:
>>> import unidecode
>>> unidecode.unidecode("Myron Dex92Shawn May".decode('cp1251'))
"Myron De'Shawn May"