用于将PCRE regexp转换为emacs regexp的Elisp机制



我承认,与emacs相比,我更喜欢PCRE regexp,如果没有其他原因的话,那就是当我键入"("时,我几乎总是想要一个分组运算符。当然,\w和类似的运算符比其他等效运算符方便得多。

当然,如果期望改变emacs的内部结构,那就太疯狂了。但我认为,应该可以从PCRE表达式转换为emacs表达式,并进行所有必要的转换,这样我就可以写:

(defun my-super-regexp-function ...
   (search-forward (pcre-convert "__\w: d+")))

(或类似)。

有人知道一个elisp库可以做到这一点吗?


编辑:从下面的答案中选择一个响应

哇,我喜欢从4天的假期回来,找到一大堆有趣的答案来整理!我喜欢这两种类型的解决方案。

最后,看起来执行脚本和直接的elisp版本的解决方案都可以工作,但从纯粹的速度和"正确性"角度来看,elisp版本肯定是人们更喜欢的版本(包括我自己)。

https://github.com/joddie/pcre2el是这个答案的最新版本。

pcre2elrxt(RegeXp Translator或RegeXpTools)是一个用于在Emacs中处理正则表达式的实用程序,基于正则表达式语法的递归下降语法分析器。除了将PCRE语法(的一个子集)转换为Emacs等效语法外,它还可以执行以下操作:

  • 将Emacs语法转换为PCRE
  • 将任一语法转换为rx,这是一种基于S表达式的正则表达式语法
  • 通过以rx形式显示解析树并突出显示相应的代码块来解开复杂的regexp
  • 显示与regexp匹配的字符串(产品)的完整列表,前提是该列表是有限的
  • 提供regexp语法的实时字体锁定(到目前为止仅适用于Elisp缓冲区——TODO列表中的其他模式)

原始答案如下。。。


这里有一个快速而丑陋的Emacs lisp解决方案(EDIT:现在更永久地位于此处)。它主要基于pcrepattern手册页中的描述,逐个令牌地工作,只转换以下结构:

  • 括号分组( .. )
  • 交替|
  • 数字重复{M,N}
  • 引用Q .. E的字符串
  • 简单字符转义:acefnrtx+八进制数字
  • 字符类:dDhHsSvV
  • wW保持原样(使用Emacs自己对单词和非单词字符的想法)

它不处理更复杂的PCRE断言,但它确实尝试在字符类中转换转义。在包含类似D的字符类的情况下,这是通过转换为具有交替的非捕获组来实现的。

它通过了我为它编写的测试,但肯定存在漏洞,而且逐个令牌扫描的方法可能很慢。换句话说,没有担保。但出于某些目的,它可能会完成工作中更简单的部分。邀请感兴趣的各方对其进行改进;-)

(eval-when-compile (require 'cl))
(defvar pcre-horizontal-whitespace-chars
  (mapconcat 'char-to-string
             '(#x0009 #x0020 #x00A0 #x1680 #x180E #x2000 #x2001 #x2002 #x2003
                      #x2004 #x2005 #x2006 #x2007 #x2008 #x2009 #x200A #x202F
                      #x205F #x3000)
             ""))
(defvar pcre-vertical-whitespace-chars
  (mapconcat 'char-to-string
             '(#x000A #x000B #x000C #x000D #x0085 #x2028 #x2029) ""))
(defvar pcre-whitespace-chars
  (mapconcat 'char-to-string '(9 10 12 13 32) ""))
(defvar pcre-horizontal-whitespace
  (concat "[" pcre-horizontal-whitespace-chars "]"))
(defvar pcre-non-horizontal-whitespace
  (concat "[^" pcre-horizontal-whitespace-chars "]"))
(defvar pcre-vertical-whitespace
  (concat "[" pcre-vertical-whitespace-chars "]"))
(defvar pcre-non-vertical-whitespace
  (concat "[^" pcre-vertical-whitespace-chars "]"))
(defvar pcre-whitespace (concat "[" pcre-whitespace-chars "]"))
(defvar pcre-non-whitespace (concat "[^" pcre-whitespace-chars "]"))
(eval-when-compile
  (defmacro pcre-token-case (&rest cases)
    "Consume a token at point and evaluate corresponding forms.
CASES is a list of `cond'-like clauses, (REGEXP FORMS
...). Considering CASES in order, if the text at point matches
REGEXP then moves point over the matched string and returns the
value of FORMS. Returns `nil' if none of the CASES matches."
    (declare (debug (&rest (sexp &rest form))))
    `(cond
      ,@(mapcar
         (lambda (case)
           (let ((token (car case))
                 (action (cdr case)))
             `((looking-at ,token)
               (goto-char (match-end 0))
               ,@action)))
         cases)
      (t nil))))
(defun pcre-to-elisp (pcre)
  "Convert PCRE, a regexp in PCRE notation, into Elisp string form."
  (with-temp-buffer
    (insert pcre)
    (goto-char (point-min))
    (let ((capture-count 0) (accum '())
          (case-fold-search nil))
      (while (not (eobp))
        (let ((translated
               (or
                ;; Handle tokens that are treated the same in
                ;; character classes
                (pcre-re-or-class-token-to-elisp)   
                ;; Other tokens
                (pcre-token-case
                 ("|" "\|")
                 ("(" (incf capture-count) "\(")
                 (")" "\)")
                 ("{" "\{")
                 ("}" "\}")
                 ;; Character class
                 ("\[" (pcre-char-class-to-elisp))
                 ;; Backslash + digits => backreference or octal char?
                 ("\\\([0-9]+\)"
                  (let* ((digits (match-string 1))
                         (dec (string-to-number digits)))
                    ;; from "man pcrepattern": If the number is
                    ;; less than 10, or if there have been at
                    ;; least that many previous capturing left
                    ;; parentheses in the expression, the entire
                    ;; sequence is taken as a back reference.   
                    (cond ((< dec 10) (concat "\" digits))
                          ((>= capture-count dec)
                           (error "backreference \%s can't be used in Emacs regexps"
                                  digits))
                          (t
                           ;; from "man pcrepattern": if the
                           ;; decimal number is greater than 9 and
                           ;; there have not been that many
                           ;; capturing subpatterns, PCRE re-reads
                           ;; up to three octal digits following
                           ;; the backslash, and uses them to
                           ;; generate a data character. Any
                           ;; subsequent digits stand for
                           ;; themselves.
                           (goto-char (match-beginning 1))
                           (re-search-forward "[0-7]\{0,3\}")
                           (char-to-string (string-to-number (match-string 0) 8))))))
                 ;; Regexp quoting.
                 ("\\Q"
                  (let ((beginning (point)))
                    (search-forward "\E")
                    (regexp-quote (buffer-substring beginning (match-beginning 0)))))
                 ;; Various character classes
                 ("\\d" "[0-9]")
                 ("\\D" "[^0-9]")
                 ("\\h" pcre-horizontal-whitespace)
                 ("\\H" pcre-non-horizontal-whitespace)
                 ("\\s" pcre-whitespace)
                 ("\\S" pcre-non-whitespace)
                 ("\\v" pcre-vertical-whitespace)
                 ("\\V" pcre-non-vertical-whitespace)
                 ;; Use Emacs' native notion of word characters
                 ("\\[Ww]" (match-string 0))
                 ;; Any other escaped character
                 ("\\\(.\)" (regexp-quote (match-string 1)))
                 ;; Any normal character
                 ("." (match-string 0))))))
          (push translated accum)))
      (apply 'concat (reverse accum)))))
(defun pcre-re-or-class-token-to-elisp ()
  "Consume the PCRE token at point and return its Elisp equivalent.
Handles only tokens which have the same meaning in character
classes as outside them."
  (pcre-token-case
   ("\\a" (char-to-string #x07))  ; bell
   ("\\c\(.\)"                  ; control character
    (char-to-string
     (- (string-to-char (upcase (match-string 1))) 64)))
   ("\\e" (char-to-string #x1b))  ; escape
   ("\\f" (char-to-string #x0c))  ; formfeed
   ("\\n" (char-to-string #x0a))  ; linefeed
   ("\\r" (char-to-string #x0d))  ; carriage return
   ("\\t" (char-to-string #x09))  ; tab
   ("\\x\([A-Za-z0-9]\{2\}\)"
    (char-to-string (string-to-number (match-string 1) 16)))
   ("\\x{\([A-Za-z0-9]*\)}"
    (char-to-string (string-to-number (match-string 1) 16)))))
(defun pcre-char-class-to-elisp ()
  "Consume the remaining PCRE character class at point and return its Elisp equivalent.
Point should be after the opening "[" when this is called, and
will be just after the closing "]" when it returns."
  (let ((accum '("["))
        (pcre-char-class-alternatives '())
        (negated nil))
    (when (looking-at "\^")
      (setq negated t)
      (push "^" accum)
      (forward-char))
    (when (looking-at "\]") (push "]" accum) (forward-char))
    (while (not (looking-at "\]"))
      (let ((translated
             (or
              (pcre-re-or-class-token-to-elisp)
              (pcre-token-case              
               ;; Backslash + digits => always an octal char
               ("\\\([0-7]\{1,3\}\)"    
                (char-to-string (string-to-number (match-string 1) 8)))
               ;; Various character classes. To implement negative char classes,
               ;; we cons them onto the list `pcre-char-class-alternatives' and
               ;; transform the char class into a shy group with alternation
               ("\\d" "0-9")
               ("\\D" (push (if negated "[0-9]" "[^0-9]")
                              pcre-char-class-alternatives) "")
               ("\\h" pcre-horizontal-whitespace-chars)
               ("\\H" (push (if negated
                                  pcre-horizontal-whitespace
                                pcre-non-horizontal-whitespace)
                              pcre-char-class-alternatives) "")
               ("\\s" pcre-whitespace-chars)
               ("\\S" (push (if negated
                                  pcre-whitespace
                                pcre-non-whitespace)
                              pcre-char-class-alternatives) "")
               ("\\v" pcre-vertical-whitespace-chars)
               ("\\V" (push (if negated
                                  pcre-vertical-whitespace
                                pcre-non-vertical-whitespace)
                              pcre-char-class-alternatives) "")
               ("\\w" (push (if negated "\W" "\w") 
                              pcre-char-class-alternatives) "")
               ("\\W" (push (if negated "\w" "\W") 
                              pcre-char-class-alternatives) "")
               ;; Leave POSIX syntax unchanged
               ("\[:[a-z]*:\]" (match-string 0))
               ;; Ignore other escapes
               ("\\\(.\)" (match-string 0))
               ;; Copy everything else
               ("." (match-string 0))))))
        (push translated accum)))
    (push "]" accum)
    (forward-char)
    (let ((class
           (apply 'concat (reverse accum))))
      (when (or (equal class "[]")
                (equal class "[^]"))
        (setq class ""))
      (if (not pcre-char-class-alternatives)
          class
        (concat "\(?:"
                class "\|"
                (mapconcat 'identity
                           pcre-char-class-alternatives
                           "\|")
                "\)")))))

我对在perlmonks上找到的perl脚本做了一些小修改(从命令行获取值),并将其保存为re_pl2el.pl(如下所示)。然后,以下内容将PCRE转换为elisp regexp做得很好,至少对于我测试的非奇异案例来说是这样。

(defun pcre-to-elre (regex)
  (interactive "MPCRE expression: ")
  (shell-command-to-string (concat "re_pl2el.pl -i -n "
                                   (shell-quote-argument regex))))
(pcre-to-elre "__\w: \d+") ;-> "__[[:word:]]: [[:digit:]]+"

它不处理像perl的shy {N,M}?构造这样的少数"角落"情况,当然也不处理代码执行等,但它可能满足您的需求,或者是一个很好的起点。由于您喜欢PCRE,我想您对perl的了解足以修复您经常使用的任何情况。如果不告诉我,我们可能会解决它们。

如果有一个脚本能将正则表达式解析为AST,然后以elisp格式将其吐出,我会更高兴(从那时起,它也可以以rx格式吐出),但我找不到任何东西可以做到这一点,而且在我应该写论文的时候,这似乎需要做很多工作。:-)我很难相信没有人做过。

下面是我的re_pl2el.pl的"改进"版本。-i表示字符串不使用双转义,-n表示不打印最后一行换行符。

#! /usr/bin/perl
#
# File: re_pl2el.pl
# Modified from http://perlmonks.org/?node_id=796020
#
# Description:
#
use strict;
use warnings;
# version 0.4

# TODO
# * wrap converter to function
# * testsuite
#--- flags
my $flag_interactive; # true => no extra escaping of backslashes
if ( int(@ARGV) >= 1 and $ARGV[0] eq '-i' ) {
    $flag_interactive = 1;
    shift @ARGV;
}
if ( int(@ARGV) >= 1 and $ARGV[0] eq '-n' ) {
    shift @ARGV;
} else {
    $="n";
}
if ( int(@ARGV) < 1 ) {
    print "usage: $0 [-i] [-n] REGEX";
    exit;
}
my $RE='w*(a|b|c)d(';
$RE='d{2,3}';
$RE='"(.*?)"';
$RE="".'"t(.*?)"';
$RE=$ARGV[0];
# print "Perlcode:t $RE";
#--- encode all  chars as escape sequence
$RE=~s##\0#g;
#--- substitute pairs of backslashes with 
$RE=~s#\\##g;
#--- hide escape sequences of t,n,... with
#    corresponding ascii code
my %ascii=(
       t =>"t",
       n=> "n"
      );
my $kascii=join "|",keys %ascii;
$RE=~s#\($kascii)#$ascii{$1}#g;

#---  normalize needless escaping
# e.g.  from /"/ to /"/, since it's no difference in perl
# but might confuse elisp
$RE=~s#\"#"#g;
#--- toggle escaping of 'backslash constructs'
my $bsc='(){}|';
$RE=~s#[$bsc]#\$&#g;  # escape them once
$RE=~s#\\##g;        # and erase double-escaping

#--- replace character classes
my %charclass=(
        w => 'word' ,   # TODO: emacs22 already knows w ???
        d => 'digit',
        s => 'space'
       );
my $kc=join "|",keys %charclass;
$RE=~s#\($kc)#[[:$charclass{$1}:]]#g;

#--- unhide pairs of backslashes
$RE=~s##\\#g;
#--- escaping for elisp string
unless ($flag_interactive){
  $RE=~s#\#\\#g; # ... backslashes
  $RE=~s#"#\"#g;   # ... quotes
}
#--- unhide escape sequences of t,n,...
my %rascii= reverse %ascii;
my $vascii=join "|",keys %rascii;
$RE=~s#($vascii)#\$rascii{$1}#g;
# print "Elispcode:t $RE";
print "$RE";
#TODO whats the elisp syntax for  ???

之前最接近的工作是对M-x重建器的扩展,请参阅

http://www.emacswiki.org/emacs/ReBuilder

或者叶文斌在PDE.上的工作

http://cpansearch.perl.org/src/YEWENBIN/Emacs-PDE-0.2.16/lisp/doc/pde.html

可能相关的是可视化正则表达式类固醇,它将查询替换扩展为使用实时预览,并允许您使用不同的正则表达式后端,包括PCRE。

最新更新