NodeJS RTF ANSI 查找单词并将其替换为特殊字符



我有一个查找和替换脚本,当单词没有任何特殊字符时,它可以正常工作。但是,很多时候会有特殊字符,因为它会查找名称。截至目前,这正在打破脚本。

该脚本查找{<some-text>}并尝试替换内容(以及删除大括号(。

例:

文本.rtf

Here's a name with special char {Kotouč}

脚本.ts

import * as fs from "fs";
// Ingest the rtf file.
const content: string = fs.readFileSync("./text.rtf", "utf8");
console.log("content::n", content);
// The string we are looking to match in file text.
const plainText: string = "{Kotouč}";
// Look for all text that matches the patter `{TEXT_HERE}`.
const anyMatchPattern: RegExp = /{(.*?)}/gi;
const matches: string[] = content.match(anyMatchPattern) || [];
const matchesLen: number = matches.length;
for (let i: number = 0; i < matchesLen; i++) {
// It correctly identifies the targeted text.
const currMatch: string = matches[i];
const isRtfMetadata: boolean = currMatch.endsWith(";}");
if (isRtfMetadata) {
continue;
}
// Here I need a way to escape `plainText` string so that it matches the source.
console.log("currMatch::", currMatch);
console.log("currMatch === plainText::", currMatch === plainText);
if (currMatch === plainText) {
const newContent: string = content.replace(currMatch, "IT_WORKS!");
console.log("newContent:", newContent);
}
}

输出

content::
{rtf1ansiansicpg1252cocoartf1671cocoasubrtf600
{fonttblf0fswissfcharset0 Helvetica;}
{colortbl;red255green255blue255;}
{*expandedcolortbl;;}
margl1440margr1440vieww10800viewh8400viewkind0
pardtx720tx1440tx2160tx2880tx3600tx4320tx5040tx5760tx6480tx7200tx7920tx8640pardirnaturalpartightenfactor0
f0fs24 cf0 Here's a name with special char {Kotouuc0u269 }.}
currMatch:: {Kotouuc0u269 }
currMatch === plainText:: false

它看起来像 ANSI 转义,我尝试使用 jsesc 但这会产生一个不同的字符串,{Kotouu010D}而不是文档生成的{Kotouuc0u269 }

如何动态转义plainText字符串变量,使其与文档中的内容匹配?

我需要的是加深我对 rtf 格式和一般文本编码的了解。

从文件中读取的原始RTF文本给了我们一些提示:

{rtf1ansiansicpg1252cocoartf1671cocoasubrtf600...

rtf 文件元数据的这一部分告诉我们一些事情。

它使用的是RTF文件格式版本1。编码是ANSI,特别是cpg1252,也称为Windows-1252CP-1252,即:

。拉丁字母的单字节字符编码

(来源(

从中得到的宝贵信息是我们知道它使用的是拉丁字母,稍后会使用。

知道使用的特定RTF版本后,我偶然发现了RTF 1.5规范

在该规范中快速搜索我正在研究的转义序列之一,发现它是 RTF 特定的转义控制序列,即uc0.所以知道我能够解析我真正想要的东西,u269.现在我知道它是 unicode,并且有很好的预感,u269代表unicode character code 269。所以我查了一下...

u269(字符代码269(将显示在此页面上进行确认。现在我知道了字符集以及需要做什么才能获得等效的纯文本(未转义(,并且我在这里使用了一个基本的 SO 帖子来启动该功能。

利用所有这些知识,我能够从那里拼凑起来。这是完整的更正脚本及其输出:

脚本.ts

import * as fs from "fs";

// Match RTF unicode control sequence: http://www.biblioscape.com/rtf15_spec.htm
const unicodeControlReg: RegExp = /\uc0\u/g;
// Extracts the unicode character from an escape sequence with handling for rtf.
const matchEscapedChars: RegExp = /\uc0\u(d{2,6})|\u(d{2,6})/g;
/**
* Util function to strip junk characters from string for comparison.
* @param {string} str
* @returns {string}
*/
const cleanupRtfStr = (str: string): string => {
return str
.replace(/s/g, "")
.replace(/\/g, "");
};
/**
* Detects escaped unicode and looks up the character by that code.
* @param {string} str
* @returns {string}
*/
const unescapeString = (str: string): string => {
const unescaped = str.replace(matchEscapedChars, (cc: string) => {
const stripped: string = cc.replace(unicodeControlReg, "");
const charCode: number = Number(stripped);
// See unicode character codes here:
//  https://unicodelookup.com/#latin/11
return String.fromCharCode(charCode);
});
// Remove all whitespace.
return unescaped;
};
// Ingest the rtf file.
const content: string = fs.readFileSync("./src/TEST.rtf", "binary");
console.log("content::n", content);
// The string we are looking to match in file text.
const plainText: string = "{Kotouč}";
// Look for all text that matches the pattern `{TEXT_HERE}`.
const anyMatchPattern: RegExp = /{(.*?)}/gi;
const matches: string[] = content.match(anyMatchPattern) || [];
const matchesLen: number = matches.length;
for (let i: number = 0; i < matchesLen; i++) {
const currMatch: string = matches[i];
const isRtfMetadata: boolean = currMatch.endsWith(";}");
if (isRtfMetadata) {
continue;
}
if (currMatch === plainText) {
const newContent: string = content.replace(currMatch, "IT_WORKS!");
console.log("nnnewContent:", newContent);
break;
}
const unescapedMatch: string = unescapeString(currMatch);
const cleanedMatch: string = cleanupRtfStr(unescapedMatch);
if (cleanedMatch === plainText) {
const newContent: string = content.replace(currMatch, "IT_WORKS_UNESCAPED!");
console.log("nnnewContent:", newContent);
break;
}
}

输出

content::
{rtf1ansiansicpg1252cocoartf1671cocoasubrtf600
{fonttblf0fswissfcharset0 Helvetica;}
{colortbl;red255green255blue255;}
{*expandedcolortbl;;}
margl1440margr1440vieww10800viewh8400viewkind0
pardtx560tx1120tx1680tx2240tx2800tx3360tx3920tx4480tx5040tx5600tx6160tx6720pardirnaturalpartightenfactor0
f0fs24 cf0 Here'92s a name with special char {Kotouuc0u269 }}

newContent: {rtf1ansiansicpg1252cocoartf1671cocoasubrtf600
{fonttblf0fswissfcharset0 Helvetica;}
{colortbl;red255green255blue255;}
{*expandedcolortbl;;}
margl1440margr1440vieww10800viewh8400viewkind0
pardtx560tx1120tx1680tx2240tx2800tx3360tx3920tx4480tx5040tx5600tx6160tx6720pardirnaturalpartightenfactor0
f0fs24 cf0 Here'92s a name with special char IT_WORKS_UNESCAPED!}

希望这可以帮助其他不熟悉字符编码/转义的人,以及它在 rtf 格式文档中的使用!

最新更新