我承认,与emacs相比,我更喜欢PCRE regexp,如果没有其他原因的话,那就是当我键入"("时,我几乎总是想要一个分组运算符。当然,\w和类似的运算符比其他等效运算符方便得多。
当然,如果期望改变emacs的内部结构,那就太疯狂了。但我认为,应该可以从PCRE表达式转换为emacs表达式,并进行所有必要的转换,这样我就可以写:
(defun my-super-regexp-function ...
(search-forward (pcre-convert "__\w: d+")))
(或类似)。
有人知道一个elisp库可以做到这一点吗?
编辑:从下面的答案中选择一个响应
哇,我喜欢从4天的假期回来,找到一大堆有趣的答案来整理!我喜欢这两种类型的解决方案。
最后,看起来执行脚本和直接的elisp版本的解决方案都可以工作,但从纯粹的速度和"正确性"角度来看,elisp版本肯定是人们更喜欢的版本(包括我自己)。
https://github.com/joddie/pcre2el是这个答案的最新版本。
pcre2el
或rxt
(RegeXp Translator或RegeXpTools)是一个用于在Emacs中处理正则表达式的实用程序,基于正则表达式语法的递归下降语法分析器。除了将PCRE语法(的一个子集)转换为Emacs等效语法外,它还可以执行以下操作:
- 将Emacs语法转换为PCRE
- 将任一语法转换为
rx
,这是一种基于S表达式的正则表达式语法- 通过以
rx
形式显示解析树并突出显示相应的代码块来解开复杂的regexp- 显示与regexp匹配的字符串(产品)的完整列表,前提是该列表是有限的
- 提供regexp语法的实时字体锁定(到目前为止仅适用于Elisp缓冲区——TODO列表中的其他模式)
原始答案如下。。。
这里有一个快速而丑陋的Emacs lisp解决方案(EDIT:现在更永久地位于此处)。它主要基于pcrepattern
手册页中的描述,逐个令牌地工作,只转换以下结构:
- 括号分组
( .. )
- 交替
|
- 数字重复
{M,N}
- 引用
Q .. E
的字符串 - 简单字符转义:
a
、c
、e
、f
、n
、r
、t
、x
和+八进制数字
- 字符类:
d
、D
、h
、H
、s
、S
、v
、V
w
和W
保持原样(使用Emacs自己对单词和非单词字符的想法)
它不处理更复杂的PCRE断言,但它确实尝试在字符类中转换转义。在包含类似D
的字符类的情况下,这是通过转换为具有交替的非捕获组来实现的。
它通过了我为它编写的测试,但肯定存在漏洞,而且逐个令牌扫描的方法可能很慢。换句话说,没有担保。但出于某些目的,它可能会完成工作中更简单的部分。邀请感兴趣的各方对其进行改进;-)
(eval-when-compile (require 'cl))
(defvar pcre-horizontal-whitespace-chars
(mapconcat 'char-to-string
'(#x0009 #x0020 #x00A0 #x1680 #x180E #x2000 #x2001 #x2002 #x2003
#x2004 #x2005 #x2006 #x2007 #x2008 #x2009 #x200A #x202F
#x205F #x3000)
""))
(defvar pcre-vertical-whitespace-chars
(mapconcat 'char-to-string
'(#x000A #x000B #x000C #x000D #x0085 #x2028 #x2029) ""))
(defvar pcre-whitespace-chars
(mapconcat 'char-to-string '(9 10 12 13 32) ""))
(defvar pcre-horizontal-whitespace
(concat "[" pcre-horizontal-whitespace-chars "]"))
(defvar pcre-non-horizontal-whitespace
(concat "[^" pcre-horizontal-whitespace-chars "]"))
(defvar pcre-vertical-whitespace
(concat "[" pcre-vertical-whitespace-chars "]"))
(defvar pcre-non-vertical-whitespace
(concat "[^" pcre-vertical-whitespace-chars "]"))
(defvar pcre-whitespace (concat "[" pcre-whitespace-chars "]"))
(defvar pcre-non-whitespace (concat "[^" pcre-whitespace-chars "]"))
(eval-when-compile
(defmacro pcre-token-case (&rest cases)
"Consume a token at point and evaluate corresponding forms.
CASES is a list of `cond'-like clauses, (REGEXP FORMS
...). Considering CASES in order, if the text at point matches
REGEXP then moves point over the matched string and returns the
value of FORMS. Returns `nil' if none of the CASES matches."
(declare (debug (&rest (sexp &rest form))))
`(cond
,@(mapcar
(lambda (case)
(let ((token (car case))
(action (cdr case)))
`((looking-at ,token)
(goto-char (match-end 0))
,@action)))
cases)
(t nil))))
(defun pcre-to-elisp (pcre)
"Convert PCRE, a regexp in PCRE notation, into Elisp string form."
(with-temp-buffer
(insert pcre)
(goto-char (point-min))
(let ((capture-count 0) (accum '())
(case-fold-search nil))
(while (not (eobp))
(let ((translated
(or
;; Handle tokens that are treated the same in
;; character classes
(pcre-re-or-class-token-to-elisp)
;; Other tokens
(pcre-token-case
("|" "\|")
("(" (incf capture-count) "\(")
(")" "\)")
("{" "\{")
("}" "\}")
;; Character class
("\[" (pcre-char-class-to-elisp))
;; Backslash + digits => backreference or octal char?
("\\\([0-9]+\)"
(let* ((digits (match-string 1))
(dec (string-to-number digits)))
;; from "man pcrepattern": If the number is
;; less than 10, or if there have been at
;; least that many previous capturing left
;; parentheses in the expression, the entire
;; sequence is taken as a back reference.
(cond ((< dec 10) (concat "\" digits))
((>= capture-count dec)
(error "backreference \%s can't be used in Emacs regexps"
digits))
(t
;; from "man pcrepattern": if the
;; decimal number is greater than 9 and
;; there have not been that many
;; capturing subpatterns, PCRE re-reads
;; up to three octal digits following
;; the backslash, and uses them to
;; generate a data character. Any
;; subsequent digits stand for
;; themselves.
(goto-char (match-beginning 1))
(re-search-forward "[0-7]\{0,3\}")
(char-to-string (string-to-number (match-string 0) 8))))))
;; Regexp quoting.
("\\Q"
(let ((beginning (point)))
(search-forward "\E")
(regexp-quote (buffer-substring beginning (match-beginning 0)))))
;; Various character classes
("\\d" "[0-9]")
("\\D" "[^0-9]")
("\\h" pcre-horizontal-whitespace)
("\\H" pcre-non-horizontal-whitespace)
("\\s" pcre-whitespace)
("\\S" pcre-non-whitespace)
("\\v" pcre-vertical-whitespace)
("\\V" pcre-non-vertical-whitespace)
;; Use Emacs' native notion of word characters
("\\[Ww]" (match-string 0))
;; Any other escaped character
("\\\(.\)" (regexp-quote (match-string 1)))
;; Any normal character
("." (match-string 0))))))
(push translated accum)))
(apply 'concat (reverse accum)))))
(defun pcre-re-or-class-token-to-elisp ()
"Consume the PCRE token at point and return its Elisp equivalent.
Handles only tokens which have the same meaning in character
classes as outside them."
(pcre-token-case
("\\a" (char-to-string #x07)) ; bell
("\\c\(.\)" ; control character
(char-to-string
(- (string-to-char (upcase (match-string 1))) 64)))
("\\e" (char-to-string #x1b)) ; escape
("\\f" (char-to-string #x0c)) ; formfeed
("\\n" (char-to-string #x0a)) ; linefeed
("\\r" (char-to-string #x0d)) ; carriage return
("\\t" (char-to-string #x09)) ; tab
("\\x\([A-Za-z0-9]\{2\}\)"
(char-to-string (string-to-number (match-string 1) 16)))
("\\x{\([A-Za-z0-9]*\)}"
(char-to-string (string-to-number (match-string 1) 16)))))
(defun pcre-char-class-to-elisp ()
"Consume the remaining PCRE character class at point and return its Elisp equivalent.
Point should be after the opening "[" when this is called, and
will be just after the closing "]" when it returns."
(let ((accum '("["))
(pcre-char-class-alternatives '())
(negated nil))
(when (looking-at "\^")
(setq negated t)
(push "^" accum)
(forward-char))
(when (looking-at "\]") (push "]" accum) (forward-char))
(while (not (looking-at "\]"))
(let ((translated
(or
(pcre-re-or-class-token-to-elisp)
(pcre-token-case
;; Backslash + digits => always an octal char
("\\\([0-7]\{1,3\}\)"
(char-to-string (string-to-number (match-string 1) 8)))
;; Various character classes. To implement negative char classes,
;; we cons them onto the list `pcre-char-class-alternatives' and
;; transform the char class into a shy group with alternation
("\\d" "0-9")
("\\D" (push (if negated "[0-9]" "[^0-9]")
pcre-char-class-alternatives) "")
("\\h" pcre-horizontal-whitespace-chars)
("\\H" (push (if negated
pcre-horizontal-whitespace
pcre-non-horizontal-whitespace)
pcre-char-class-alternatives) "")
("\\s" pcre-whitespace-chars)
("\\S" (push (if negated
pcre-whitespace
pcre-non-whitespace)
pcre-char-class-alternatives) "")
("\\v" pcre-vertical-whitespace-chars)
("\\V" (push (if negated
pcre-vertical-whitespace
pcre-non-vertical-whitespace)
pcre-char-class-alternatives) "")
("\\w" (push (if negated "\W" "\w")
pcre-char-class-alternatives) "")
("\\W" (push (if negated "\w" "\W")
pcre-char-class-alternatives) "")
;; Leave POSIX syntax unchanged
("\[:[a-z]*:\]" (match-string 0))
;; Ignore other escapes
("\\\(.\)" (match-string 0))
;; Copy everything else
("." (match-string 0))))))
(push translated accum)))
(push "]" accum)
(forward-char)
(let ((class
(apply 'concat (reverse accum))))
(when (or (equal class "[]")
(equal class "[^]"))
(setq class ""))
(if (not pcre-char-class-alternatives)
class
(concat "\(?:"
class "\|"
(mapconcat 'identity
pcre-char-class-alternatives
"\|")
"\)")))))
我对在perlmonks上找到的perl脚本做了一些小修改(从命令行获取值),并将其保存为re_pl2el.pl
(如下所示)。然后,以下内容将PCRE转换为elisp regexp做得很好,至少对于我测试的非奇异案例来说是这样。
(defun pcre-to-elre (regex)
(interactive "MPCRE expression: ")
(shell-command-to-string (concat "re_pl2el.pl -i -n "
(shell-quote-argument regex))))
(pcre-to-elre "__\w: \d+") ;-> "__[[:word:]]: [[:digit:]]+"
它不处理像perl的shy {N,M}?
构造这样的少数"角落"情况,当然也不处理代码执行等,但它可能满足您的需求,或者是一个很好的起点。由于您喜欢PCRE,我想您对perl的了解足以修复您经常使用的任何情况。如果不告诉我,我们可能会解决它们。
如果有一个脚本能将正则表达式解析为AST,然后以elisp格式将其吐出,我会更高兴(从那时起,它也可以以rx
格式吐出),但我找不到任何东西可以做到这一点,而且在我应该写论文的时候,这似乎需要做很多工作。:-)我很难相信没有人做过。
下面是我的re_pl2el.pl的"改进"版本。-i
表示字符串不使用双转义,-n
表示不打印最后一行换行符。
#! /usr/bin/perl
#
# File: re_pl2el.pl
# Modified from http://perlmonks.org/?node_id=796020
#
# Description:
#
use strict;
use warnings;
# version 0.4
# TODO
# * wrap converter to function
# * testsuite
#--- flags
my $flag_interactive; # true => no extra escaping of backslashes
if ( int(@ARGV) >= 1 and $ARGV[0] eq '-i' ) {
$flag_interactive = 1;
shift @ARGV;
}
if ( int(@ARGV) >= 1 and $ARGV[0] eq '-n' ) {
shift @ARGV;
} else {
$="n";
}
if ( int(@ARGV) < 1 ) {
print "usage: $0 [-i] [-n] REGEX";
exit;
}
my $RE='w*(a|b|c)d(';
$RE='d{2,3}';
$RE='"(.*?)"';
$RE=" ".'"t(.*?)"';
$RE=$ARGV[0];
# print "Perlcode:t $RE";
#--- encode all chars as escape sequence
$RE=~s# #\0#g;
#--- substitute pairs of backslashes with
$RE=~s#\\# #g;
#--- hide escape sequences of t,n,... with
# corresponding ascii code
my %ascii=(
t =>"t",
n=> "n"
);
my $kascii=join "|",keys %ascii;
$RE=~s#\($kascii)#$ascii{$1}#g;
#--- normalize needless escaping
# e.g. from /"/ to /"/, since it's no difference in perl
# but might confuse elisp
$RE=~s#\"#"#g;
#--- toggle escaping of 'backslash constructs'
my $bsc='(){}|';
$RE=~s#[$bsc]#\$&#g; # escape them once
$RE=~s#\\##g; # and erase double-escaping
#--- replace character classes
my %charclass=(
w => 'word' , # TODO: emacs22 already knows w ???
d => 'digit',
s => 'space'
);
my $kc=join "|",keys %charclass;
$RE=~s#\($kc)#[[:$charclass{$1}:]]#g;
#--- unhide pairs of backslashes
$RE=~s# #\\#g;
#--- escaping for elisp string
unless ($flag_interactive){
$RE=~s#\#\\#g; # ... backslashes
$RE=~s#"#\"#g; # ... quotes
}
#--- unhide escape sequences of t,n,...
my %rascii= reverse %ascii;
my $vascii=join "|",keys %rascii;
$RE=~s#($vascii)#\$rascii{$1}#g;
# print "Elispcode:t $RE";
print "$RE";
#TODO whats the elisp syntax for ???
之前最接近的工作是对M-x重建器的扩展,请参阅
http://www.emacswiki.org/emacs/ReBuilder
或者叶文斌在PDE.上的工作
http://cpansearch.perl.org/src/YEWENBIN/Emacs-PDE-0.2.16/lisp/doc/pde.html
可能相关的是可视化正则表达式类固醇,它将查询替换扩展为使用实时预览,并允许您使用不同的正则表达式后端,包括PCRE。