Lua模式匹配



我有一个文件,它是从具有RTF格式标记的Microsoft Lync会话中提取值而得到的。示例文件如下:

Segoe UI{\colortbl;\red0\green0\blue0;}{**generator Riched20 15.0.4420}{**\mmathPr\mwrapIndent1440}\viewkind4\uc1\pard \embo\f0\fs20 Craig。。。\embo0\embo请\embo0\embo关闭\embo0\embo>退出\embo0\embo的\embo0\你的\embo2\embo旧的\embo0\emboo客户端\embo0\ebo>和\embo0\enbo重新打开\embo0\f1\par{**lyncflags rtf=1}}

使用Lua脚本,我试图删除RTF标记,只提取对话的文本。所以我的函数的结果应该是:

克雷格。。。请关闭您的旧客户并重新打开

我尝试过使用带有正则表达式的string.gsub来匹配模式,并用空格替换它们,只留下文本,但它不起作用。这是我迄今为止为字符串编写的代码。gsub:

result = string.gsub(s, "{*?\[^{}]+}|[{}]|\n?[A-Za-z]+n?(?:-?d+)?[ ]?", " ")

如有任何建议,我们将不胜感激!

附加:

user1@capital.com@2013-01-18 17:48:03Z(致:user2@capital.com)

{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\finding1033{\fonttbl{\f0\fnil\fcharset0 Segoe UI;}{\f1\fnil Segoe用户界面;}}{\colortbl;\red0\green0\blue0;}{**generator Riched20 15.0.4420}{**\mmathPr\mwrapIndent1440}\viewkind4\uc1\pard\cf1\embo\f0\fs20为\embo0\embo me工作\embo0\embo如何\embo0\ embo关于\embo0\embo嵌入\embo0\EBO图片?\embo0\f1\par{**lyncflags rtf=1}}

user1@capital.com@2013-01-18 17:48:57Z(致:user2@capital.com)

{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\finding1033{\fonttbl{\f0\fnil\fcharset0 Segoe UI;}{\f1\fnil Segoe用户界面;}}{\colortbl;\red0\green0\blue0;}{**generator Riched20 15.0.4420}{**\mmathPr\mwrapIndent1440}\viewkind4\uc1\pard{**lyncflags rtf=1}}

user1@capital.com@2013-01-18 17:49:27Z(致:user2@capital.com)

{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\finding1033{\fonttbl{\f0\fnil\fcharset0 Segoe UI;}{\f1\fnil Segoe用户界面;}}{\colortbl;\red0\green0\blue0;}{**generator Riched20 15.0.4420}{**\mmathPr\mwrapIndent1440}\viewkind4\uc1\让我们来参加一次会议。\embo0\f1\par{**lyncflags rtf=1}}

Lua模式没有or运算符(|)或可选分组((?:...)?)。像这样的东西可能会起作用:

s:match("{(.+)}"):gsub("%b{}", ""):gsub("\%w+", "")

将返回:

"    Craig...  please  close  >out  of  your  old  client  >and  re-open "

第一个gsub删除所有成对的{}及其内容,第二个gsub删除所有rtf标签(尽管似乎有一些标签允许其中有空格,所以您可能需要调整模式)。

试试这个:

local s = '{rtf1fbidisansiansicpg1252deff0nouicompatdeflang1033{fonttbl{f0fnilfcharset0 >Segoe UI;}{f1fnil Segoe UI;}} {colortbl ;red0green0blue0;} {*generator Riched20 15.0.4420}{*mmathPrmwrapIndent1440 }viewkind4uc1 pardcf1embof0fs20 Craig...embo0 embo pleaseembo0 embo closeembo0 embo >outembo0 embo ofembo0 embo yourembo0 embo oldembo0 embo clientembo0 embo >andembo0 embo re-openembo0f1par {*lyncflags rtf=1}}n'
    ..'{rtf1fbidisansiansicpg1252deff0nouicompatdeflang1033{fonttbl{f0fnilfcharset0 Segoe UI;}{f1fnil Segoe UI;}} {colortbl ;red0green0blue0;} {*generator Riched20 15.0.4420}{*mmathPrmwrapIndent1440 }viewkind4uc1 pardcf1embof0fs20 worksembo0 embo forembo0 embo me..embo0 embo howembo0 embo aboutembo0 embo embeddingembo0 embo pictures?embo0f1par {*lyncflags rtf=1}}n'
    ..'{rtf1fbidisansiansicpg1252deff0nouicompatdeflang1033{fonttbl{f0fnilfcharset0 Segoe UI;}{f1fnil Segoe UI;}} {colortbl ;red0green0blue0;} {*generator Riched20 15.0.4420}{*mmathPrmwrapIndent1440 }viewkind4uc1 pardcf1embof0fs20 Iembo0 embo seeembo0 embo itembo0f1par {*lyncflags rtf=1}}n'
    ..'{rtf1fbidisansiansicpg1252deff0nouicompatdeflang1033{fonttbl{f0fnilfcharset0 Segoe UI;}{f1fnil Segoe UI;}} {colortbl ;red0green0blue0;} {*generator Riched20 15.0.4420}{*mmathPrmwrapIndent1440 }viewkind4uc1 pardcf1embof0fs20 let'sembo0 embo tryembo0 embo aembo0 embo meeting.embo0f1par {*lyncflags rtf=1}}n'
local text = string.gsub(s, '{(.-)}[}]?', ''):gsub('embo',''):gsub('0',''):gsub('iewkind4uc1 pardcf1',''):gsub('1par',''):gsub('s2',''):gsub('>','')
print(text)

输出

Craig... please close out of your old client and re-open
works for me.. how about embedding pictures?
I see it
let's try a meeting.

最新更新