使用regex解析delimeter和指定子字符串结尾之间的数据

我正试图从一堆半不可预测的字符串中解析出名称。更具体地说，我使用的是ruby，但我认为这并不重要。这是一个人为的例子，但一些字符串的例子是：

Eagles vs Bears
NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN
NFL Matchup: Philadelphia Eagles VS Chicago Bears - TUNE IN
Philadelphia Eagles vs Chicago Bears - NFL Match
Phil.Eagles vs Chic.Bears
3agles vs B3ars

我想出的正则表达式是

([0-9A-Z .]*) vs ([0-9A-Z .]*)(?:[ -:]*tune)?/i

但在"；NFL对决：费城老鹰队VS芝加哥熊队；我收到Chicago Bears TUNE作为第二场比赛。我正试图删除"；调谐"；所以它属于自己的小组。

我认为通过添加(?:[ -:]*tune)?，可以像在中间添加vs一样分离表达式的结尾部分，但事实并非如此。如果我在最后删除?，它与上面的示例正确匹配，但与Eagles vs Bears不再匹配

如果有人能帮我，如果你能把你的正则表达式一块一块地分解，我将不胜感激。

您可以捕获第二组，直到-、:或tune前面有零个或多个空格，或者直到行的末尾，同时使第二组模式变惰性：

([w .]*) vs ([w .]*?)(?=s*(?:[:-]|tune|$))

请参阅regex演示。

详细信息：

([w .]*)-组1：尽可能多的零个或多个字、空格或.个字符
vs-vs字符串
([w .]*?)-第2组：零个或多个字、空格或尽可能少的.个字符
(?=s*(?:[:-]|tune|$))-一个积极的前瞻性，需要以下模式立即出现在当前位置的右侧：
- s*-零个或多个空白
- (?:[:-]|tune|$)-:或-、tune或线路末端

您可以使用我在自由间距模式下表达的以下正则表达式，使其成为自文档(在链接中搜索"自由间距模式"(。

rgx = /
(?: |A)              # match space or beginning of string
(?<team1>             # begin capture group team1
(?<team>            # begin capture group team
(?<word>          # begin capture group word
(?:p{Lu}|d)   # word begins with an uppercase letter or digit
(?:p{Ll}|d)+  # ...followed by 1+ lowercase letters or digits
)                 # end capture group word
(?:               # begin non-capture group
[ .]            # match a space or period
g<word>        # match another word
)*                # end non-capture group and execute 1+ times
)                   # end capture group team
)                     # end capture group team1
[ ]+                  # match one or more spaces
(?:VS|vs)             # match literal
[ ]+                  # match one or more spaces
(?<team2>             # begin capture group team2
g<team>            # match the second team name
)                     # end capture group team2
(?:                   # begin non-capture group
[ ]                 # match a space
(?:                 # begin non-capture group
(?:-[ ])?         # optionally match literal
TUNE[ ]IN         # match literal
|                 # or
-[ ]NFL[ ]Match   # match literal
)                   # end inner capture group 
)?                    # end outer non-capture group and make it optional
z                    # match end of string
/x                    # free-spacing regex definition mode

examples = [
"Eagles vs Bears",
"NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN",
"NFL Matchup: Philadelphia Eagles VS Chicago Bears - TUNE IN",
"Philadelphia Eagles vs Chicago Bears - NFL Match",
"Phil.Eagles vs Chic.Bears",
"3agles vs B3ars"
]

examples.map do |s|
m = s.match(rgx)
[m[:team1], m[:team2]]
end
#=> [["Eagles", "Bears"],
#    ["Philadelphia Eagles", "Chicago Bears"],
#    ["Philadelphia Eagles", "Chicago Bears"],
#    ["Philadelphia Eagles", "Chicago Bears"],
#    ["Phil.Eagles", "Chic.Bears"],
#    ["3agles", "B3ars"]]

请参阅Regexp#match和MatchData#[]。

注意，g<word>和g<team>分别有效地复制包含在捕获组word和team中的代码。这些被称为"；子表达式调用"；。有关更多信息，请在Regexp上搜索该术语。使用子例程调用有两个优点：所需代码较少，并且减少了编码错误的机会。

相关内容

最新更新

热门标签：