正则表达式返回 true，但日语字符不正确

我正在用表格检查是否输入了符合日语格式的邮政编码。我今天意识到有些信息通过了，即使它不应该"通过"正则表达式匹配测试。

这是正则表达式：

".*([0-9０-９]{3}[-ー]{1}[0-9０-９]{4}).*"

它包括普通数字和日语数字(与"-"相同，日语数字"ー"也可以输入(，格式应为： 123-4567.

当只输入拉丁字母和数字时，它工作正常。但是一些根本不匹配的日语字符...作为匹配项返回：

(注意：匹配将返回一些东西，没有匹配将返回任何内容。

>>> import re
>>> regstr = ".*([0-9０-９]{3}[-ー]{1}[0-9０-９]{4}).*"
>>> re.match( regstr, "this is obviously not going to work")
>>> re.match( regstr, "this is going to work 123-4567")
<_sre.SRE_Match object at 0x7fced8b485d0>
>>> re.match( regstr, "this is going to work too １２３ー４５６７")
<_sre.SRE_Match object at 0x7fced8b48648>
>>> re.match( regstr, "This will not work, as it should not :  1234-567")
>>> re.match( regstr, "This should not work, but it does :  １２３４ー５６７")
<_sre.SRE_Match object at 0x7fced8b48648>
>>> re.match( regstr, "Now just seems crazy ....... 京都府")
<_sre.SRE_Match object at 0x7fced8b485d0>
>>> re.match( regstr, "京都府")
<_sre.SRE_Match object at 0x7fced8b48648>
>>> "京都府"
'xe4xbaxacxe9x83xbdxe5xbax9c'
>>> re.match( regstr, "xe4xbaxacxe9x83xbdxe5xbax9c")
<_sre.SRE_Match object at 0x7fced8b48648>

我尝试输入汉字，但我尝试的几个字符不匹配。

所以任何住在京都府的人...可以"绕过"正则表达式，因为"京都府"足以使整个字符串有效。这三个字符中只有两个不起作用。

我尝试使用这三个字符的 unicode 代码，它也确实匹配(我想知道是否可以使用该代码代替字符本身来解析字符串，并希望确保它不包含实际适合"000-0000"的东西。它没有，但它仍然与正则表达式匹配(。

住在东京"東京府"的人会"不那么"幸运哈哈：

>>> re.match( regstr, "東京府")
>>> "東京府"
'xe6x9dxb1xe4xbaxacxe5xbax9c'

我在那里检查了：https://regex101.com/，这 3 个字符没有

所以。。。我几乎迷失在这里。用更简单的".([0-9]{3}[-]{1}[0-9]{4}(。作为正则表达式，它似乎很好，但我真的不想限制用户只输入 [0-9-]，因为许多人会输入日语版本 0123456789ー(更长(。如果重要：

# 'Japanese numbers' code
>>> "０１２３４５６７８９ー"
'xefxbcx90xefxbcx91xefxbcx92xefxbcx93xefxbcx94xefxbcx95xefxbcx96xefxbcx97xefxbcx98xefxbcx99xe3x83xbc'

我现在只将日语 0123456798ー转换为 0123456789-，并应用一个根本不包含日语字符的正则表达式，但是......我真的很想知道正则表达式和日语字符是怎么回事。

如果有人有一些线索，那将不胜感激。

干杯

编辑：蟒蛇2.7

regstr = ".*([0-9０-９]{3}[-ー]{1}[0-9０-９]{4}).*"

在 Python 3 中，regstr将是一个包含一些非 ascii 字符的 unicode 字符串。在 Python 2 中，它是一个以某种编码编码的字符串，这取决于您在模块开头声明的内容(请参阅 PEP 263(以及实际用于保存文件的编码。为了避免此类问题，我建议您永远不要在正则表达式中使用 unicode 字符。这太难调试了。而是逃避它们。

字符0123456789是 unicode 字符'uff10''uff19'，所以我建议你应该这样使用它们。

此外，如果您使用的是 unicode 正则表达式，则应使用 unicode 字符串的u前缀来定义它：

regstr = u".*([0-9uff10-uff19]{3}[-u30fc]{1}[0-9uff10-uff19]{4}).*"

稍后，当您将此正则表达式与某个字符串匹配时，该其他字符串也应该是unicode字符串，而不是普通str。为此，您必须知道输入采用哪种编码。例如，如果输入utf-8，请使用：

input_string_as_unicode = unicode(input_string_as_utf8, 'utf-8')
re.match(regstr, input_string_as_unicode)

请注意，您可能已经将输入作为unicode，如果背后有一些框架为您执行此操作。如果您不确定，请检查type(input_string)。

我刚刚在 Python 3.6.6 上尝试了您的测试，它按预期工作。我所做的唯一不同的事情是使用re.compile代替。看：

Python 3.6.6 (default, Jul 19 2018, 14:25:17) 
[GCC 8.1.1 20180712 (Red Hat 8.1.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> zipcode = re.compile(r'.*([0-9０-９]{3}[-ー]{1}[0-9０-９]{4}).*')
>>> zipcode.match("this is obviously not going to work")
>>> zipcode.match("this is going to work 123-4567")
<_sre.SRE_Match object; span=(0, 30), match='this is going to work 123-4567'>
>>> zipcode.match("this is going to work 123-4567").group(0)
'this is going to work 123-4567'
>>> zipcode.match("this is going to work 123-4567").group(1)
'123-4567'
>>> zipcode.match("this is going to work too １２３ー４５６７").group(1)
'１２３ー４５６７'
>>> zipcode.match("This should not work, but it does :  １２３４ー５６７").group(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> zipcode.match("This should not work, but it does :  １２３４ー５６７")
>>> zipcode.match("Now just seems crazy ....... 京都府")
>>> zipcode.match("京都府")
>>>

编辑

这是我到目前为止所拥有的：

$ cat ziptest.py 
# -*- coding: utf-8 -*-
import re
zipcode = re.compile(r'.*([0-9０１２３４５６７８９]{3}[-ー]{1}[0-9０１２３４５６７８９]{4}).*')
tests = (
"this is obviously not going to work",
"this is going to work 123-4567",
"this is going to work too １２３ー４５６７",
"This will not work, as it should not :  1234-567",
"This should not work, but it does :  １２３４ー５６７",
"Now just seems crazy ....... 京都府",
"京都府",
"xe4xbaxacxe9x83xbdxe5xbax9c"
)
for test in tests:
print('%s: %s' % (test, "Match" if zipcode.match(test) else "No match"))
$

以下是结果：

$ python2.7 ziptest.py 
this is obviously not going to work: No match
this is going to work 123-4567: Match
this is going to work too １２３ー４５６７: Match
This will not work, as it should not :  1234-567: No match
This should not work, but it does :  １２３４ー５６７: Match
Now just seems crazy ....... 京都府: No match
京都府: No match
京都府: No match
$ python3.6 ziptest.py 
this is obviously not going to work: No match
this is going to work 123-4567: Match
this is going to work too １２３ー４５６７: Match
This will not work, as it should not :  1234-567: No match
This should not work, but it does :  １２３４ー５６７: No match
Now just seems crazy ....... 京都府: No match
京都府: No match
äº¬é½åº: No match

我希望它有所帮助。

相关内容

最新更新

热门标签：