使用regex解析delimeter和指定子字符串结尾之间的数据



我正试图从一堆半不可预测的字符串中解析出名称。更具体地说,我使用的是ruby,但我认为这并不重要。这是一个人为的例子,但一些字符串的例子是:

Eagles vs Bears
NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN
NFL Matchup: Philadelphia Eagles VS Chicago Bears - TUNE IN
Philadelphia Eagles vs Chicago Bears - NFL Match
Phil.Eagles vs Chic.Bears
3agles vs B3ars

我想出的正则表达式是

([0-9A-Z .]*) vs ([0-9A-Z .]*)(?:[ -:]*tune)?/i

但在";NFL对决:费城老鹰队VS芝加哥熊队;我收到Chicago Bears TUNE作为第二场比赛。我正试图删除";调谐";所以它属于自己的小组。

我认为通过添加(?:[ -:]*tune)?,可以像在中间添加vs一样分离表达式的结尾部分,但事实并非如此。如果我在最后删除?,它与上面的示例正确匹配,但与Eagles vs Bears不再匹配

如果有人能帮我,如果你能把你的正则表达式一块一块地分解,我将不胜感激。

您可以捕获第二组,直到-:tune前面有零个或多个空格,或者直到行的末尾,同时使第二组模式变惰性:

([w .]*) vs ([w .]*?)(?=s*(?:[:-]|tune|$))

请参阅regex演示。

详细信息

  • ([w .]*)-组1:尽可能多的零个或多个字、空格或.个字符
  • vs-vs字符串
  • ([w .]*?)-第2组:零个或多个字、空格或尽可能少的.个字符
  • (?=s*(?:[:-]|tune|$))-一个积极的前瞻性,需要以下模式立即出现在当前位置的右侧:
    • s*-零个或多个空白
    • (?:[:-]|tune|$)-:-tune或线路末端

您可以使用我在自由间距模式下表达的以下正则表达式,使其成为自文档(在链接中搜索"自由间距模式"(。

rgx = /
(?: |A)              # match space or beginning of string
(?<team1>             # begin capture group team1
(?<team>            # begin capture group team
(?<word>          # begin capture group word
(?:p{Lu}|d)   # word begins with an uppercase letter or digit
(?:p{Ll}|d)+  # ...followed by 1+ lowercase letters or digits
)                 # end capture group word
(?:               # begin non-capture group
[ .]            # match a space or period
g<word>        # match another word
)*                # end non-capture group and execute 1+ times
)                   # end capture group team
)                     # end capture group team1
[ ]+                  # match one or more spaces
(?:VS|vs)             # match literal
[ ]+                  # match one or more spaces
(?<team2>             # begin capture group team2
g<team>            # match the second team name
)                     # end capture group team2
(?:                   # begin non-capture group
[ ]                 # match a space
(?:                 # begin non-capture group
(?:-[ ])?         # optionally match literal
TUNE[ ]IN         # match literal
|                 # or
-[ ]NFL[ ]Match   # match literal
)                   # end inner capture group 
)?                    # end outer non-capture group and make it optional
z                    # match end of string
/x                    # free-spacing regex definition mode     
examples = [
"Eagles vs Bears",
"NFL Matchup: Philadelphia Eagles VS Chicago Bears TUNE IN",
"NFL Matchup: Philadelphia Eagles VS Chicago Bears - TUNE IN",
"Philadelphia Eagles vs Chicago Bears - NFL Match",
"Phil.Eagles vs Chic.Bears",
"3agles vs B3ars"
]
examples.map do |s|
m = s.match(rgx)
[m[:team1], m[:team2]]
end
#=> [["Eagles", "Bears"],
#    ["Philadelphia Eagles", "Chicago Bears"],
#    ["Philadelphia Eagles", "Chicago Bears"],
#    ["Philadelphia Eagles", "Chicago Bears"],
#    ["Phil.Eagles", "Chic.Bears"],
#    ["3agles", "B3ars"]]

请参阅Regexp#match和MatchData#[]。

注意,g<word>g<team>分别有效地复制包含在捕获组wordteam中的代码。这些被称为";子表达式调用";。有关更多信息,请在Regexp上搜索该术语。使用子例程调用有两个优点:所需代码较少,并且减少了编码错误的机会。

相关内容

  • 没有找到相关文章

最新更新