我正试图从一堆半不可预测的字符串中解析出名称。更具体地说,我使用的是ruby,但我认为这并不重要。这是一个人为的例子,但一些字符串的例子是:
Eagles vs Bears
NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN
NFL Matchup: Philadelphia Eagles VS Chicago Bears - TUNE IN
Philadelphia Eagles vs Chicago Bears - NFL Match
Phil.Eagles vs Chic.Bears
3agles vs B3ars
我想出的正则表达式是
([0-9A-Z .]*) vs ([0-9A-Z .]*)(?:[ -:]*tune)?/i
但在";NFL对决:费城老鹰队VS芝加哥熊队;我收到Chicago Bears TUNE
作为第二场比赛。我正试图删除";调谐";所以它属于自己的小组。
我认为通过添加(?:[ -:]*tune)?
,可以像在中间添加vs
一样分离表达式的结尾部分,但事实并非如此。如果我在最后删除?
,它与上面的示例正确匹配,但与Eagles vs Bears
不再匹配
如果有人能帮我,如果你能把你的正则表达式一块一块地分解,我将不胜感激。
您可以捕获第二组,直到-
、:
或tune
前面有零个或多个空格,或者直到行的末尾,同时使第二组模式变惰性:
([w .]*) vs ([w .]*?)(?=s*(?:[:-]|tune|$))
请参阅regex演示。
详细信息:
([w .]*)
-组1:尽可能多的零个或多个字、空格或.
个字符vs
-vs
字符串([w .]*?)
-第2组:零个或多个字、空格或尽可能少的.
个字符(?=s*(?:[:-]|tune|$))
-一个积极的前瞻性,需要以下模式立即出现在当前位置的右侧:s*
-零个或多个空白(?:[:-]|tune|$)
-:
或-
、tune
或线路末端
您可以使用我在自由间距模式下表达的以下正则表达式,使其成为自文档(在链接中搜索"自由间距模式"(。
rgx = /
(?: |A) # match space or beginning of string
(?<team1> # begin capture group team1
(?<team> # begin capture group team
(?<word> # begin capture group word
(?:p{Lu}|d) # word begins with an uppercase letter or digit
(?:p{Ll}|d)+ # ...followed by 1+ lowercase letters or digits
) # end capture group word
(?: # begin non-capture group
[ .] # match a space or period
g<word> # match another word
)* # end non-capture group and execute 1+ times
) # end capture group team
) # end capture group team1
[ ]+ # match one or more spaces
(?:VS|vs) # match literal
[ ]+ # match one or more spaces
(?<team2> # begin capture group team2
g<team> # match the second team name
) # end capture group team2
(?: # begin non-capture group
[ ] # match a space
(?: # begin non-capture group
(?:-[ ])? # optionally match literal
TUNE[ ]IN # match literal
| # or
-[ ]NFL[ ]Match # match literal
) # end inner capture group
)? # end outer non-capture group and make it optional
z # match end of string
/x # free-spacing regex definition mode
examples = [
"Eagles vs Bears",
"NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN",
"NFL Matchup: Philadelphia Eagles VS Chicago Bears - TUNE IN",
"Philadelphia Eagles vs Chicago Bears - NFL Match",
"Phil.Eagles vs Chic.Bears",
"3agles vs B3ars"
]
examples.map do |s|
m = s.match(rgx)
[m[:team1], m[:team2]]
end
#=> [["Eagles", "Bears"],
# ["Philadelphia Eagles", "Chicago Bears"],
# ["Philadelphia Eagles", "Chicago Bears"],
# ["Philadelphia Eagles", "Chicago Bears"],
# ["Phil.Eagles", "Chic.Bears"],
# ["3agles", "B3ars"]]
请参阅Regexp#match和MatchData#[]。
注意,g<word>
和g<team>
分别有效地复制包含在捕获组word
和team
中的代码。这些被称为";子表达式调用";。有关更多信息,请在Regexp上搜索该术语。使用子例程调用有两个优点:所需代码较少,并且减少了编码错误的机会。