如何将HTML转换为具有文本和格式的对象结构?



我需要转换一个HTML字符串与嵌套的标签,像这样:

const strHTML = "<p>Hello World</p><p>I am a text with <strong>bold</strong> word</p><p><strong>I am bold text with nested <em>italic</em> Word.</strong></p>"

将转换为以下数组:

const result = [{
text: "Hello World",
format: null
}, {
text: "I am a text with",
format: null
}, {
text: "bold",
format: ["strong"]
}, {
text: " word",
format: null
}, {
text: "I am a text with nested",
format: ["strong"]
}, {
text: "italic",
format: ["strong", "em"]
}, {
text: "Word.",
format: ["strong"]
}];

只要没有嵌套的标签,我就用DOMParser()管理转换。我不能让它运行嵌套标签,就像在最后一段,所以我的整个段落是粗体的,但单词"斜体"应该是粗体和斜体。我无法让它作为递归运行。

如有任何帮助,不胜感激。

目前我写的代码是这样的:

export interface Phrase {
text: string;
format: string | string[];
}
export class HTMLParser {
public parse(text: string): void {
const parser = new DOMParser();
const sourceDocument = parser.parseFromString(text, "text/html");
this.parseChildren(sourceDocument.body.childNodes);
// HERE SHOULD BE the result
console.log("RESULT of CONVERSION", this.phrasesProcessed);
}
public phrasesProcessed: Phrase[] = [];
private parseChildren(toParse: NodeListOf<ChildNode>) {
this.phrasesProcessed = [];
try {
Array.from(toParse)
.map(item => {
if (item.nodeType === Node.ELEMENT_NODE && item instanceof HTMLElement) {
return Array.from(item.childNodes).map(child => ({ text: child.textContent, format: (child.nodeType === Node.ELEMENT_NODE && child instanceof HTMLElement) ? child.tagName : null }));
} else {
return Array.from(item.childNodes).map(child => ({ text: child.textContent, format: null }));
}
})
.filter(line => line.length) // only non emtpy arrays
.map(element => ([...element, { text: "n", format: null }])) // add linebreak after each P
.reduce((acc: (Phrase)[], val) => acc.concat(val), []) // flatten
.forEach(
element => {
// console.log("ELEMENT", element);
this.phrasesProcessed.push(element);
}
);
} catch (e) {
console.warn(e);
}
}
}

可以使用递归。这似乎是生成函数的一个很好的例子。由于不清楚在format(显然不是p)中应该保留哪些标记,因此我将其作为一个配置来提供:

const formatTags = new Set(["b", "big", "code", "del", "em", "i", "pre", "s", "small", "strike", "strong", "sub", "sup", "u"]);
function* iterLeafNodes(nodes, format=[]) {
for (let node of nodes) {
if (node.nodeType == 3) {
yield ({text: node.nodeValue, format: format.length ? [...format] : null});
} else {
const tag = node.tagName.toLowerCase();
yield* iterLeafNodes(node.childNodes, 
formatTags.has(tag) ? format.concat(tag) : format);
}
}
}
// Example input
const strHTML = "<p>Hello World</p><p>I am a text with <strong>bold</strong> word</p><p><strong>I am bold text with nested <em>italic</em> Word.</strong></p>"
const nodes = new DOMParser().parseFromString(strHTML, 'text/html').body.childNodes;
let result = [...iterLeafNodes(nodes)];
console.log(result);

请注意,当文本分布在多个标签上时,它仍然会分割文本,这些标签被认为是非格式化标签,如span

其次,我不相信null作为format的可能值比空数组[]更有用,但无论如何,上面在这种情况下产生null

特殊情况-插入n

在注释中,要求在每个p元素之后插入一个换行符。

下面的代码将生成额外的元素。这里我也用[]代替null代替format:

const formatTags = new Set(["b", "big", "code", "del", "em", "i", "pre", "s", "small", "strike", "strong", "sub", "sup", "u"]);
function* iterLeafNodes(nodes, format=[]) {
for (let node of nodes) {
if (node.nodeType == 3) {
yield ({text: node.nodeValue, format: [...format]});
} else {
const tag = node.tagName.toLowerCase();
yield* iterLeafNodes(node.childNodes, 
formatTags.has(tag) ? format.concat(tag) : format);
if (tag === "p") yield ({text: "n", format: [...format]});
}
}
}
// Example input
const strHTML = "<p>Hello World</p><p>I am a text with <strong>bold</strong> word</p><p><strong>I am bold text with nested <em>italic</em> Word.</strong></p>"
const nodes = new DOMParser().parseFromString(strHTML, 'text/html').body.childNodes;
let result = [...iterLeafNodes(nodes)];
console.log(result);

您可以递归地遍历子节点并使用像FORMAT_NODES这样的数组构造所需的数组。

const FORMAT_NODES = ["strong", "em"];
function getText(node, parents = [], res = []) {
if (node.nodeName === "#text") {
const text = node.textContent.trim();
if (text) {
const format = parents.filter((p) => FORMAT_NODES.includes(p));
res.push({ text, format: format.length ? format : null });
}
} else {
node.childNodes.forEach((node) =>
getText(node, parents.concat(node.nodeName.toLowerCase()), res)
);
}
return res;
}
const container = document.querySelector("#container");
const result = getText(container);
console.log(result);
<div id="container">
<p>Hello World</p>
<p>I am a text with <strong>bold</strong> word</p>
<p><strong>I am bold text with nested <em>italic</em> Word.</strong></p>
</div>

相关文件:

  • Node.childNodes
  • Node.parentNode
  • Array.prototype.concat
  • Array.prototype.includes

这个版本与这里发布的其他两个版本并没有太大的不同,但在职责上有不同的细分。

const getTextNodes = (node, path = []) =>
node .nodeType === 3
? {text: node .nodeValue, path}
: [... node .childNodes] .flatMap ((child) => getTextNodes (child, [... path, node .tagName .toLowerCase()]))
const extract = (keep) => (html) =>
[...new DOMParser () .parseFromString (html, 'text/html') .body .childNodes] 
.flatMap (node => getTextNodes (node))
.map (({text, path = []}) => ({text, format: [...new Set (path .filter (p => keep .includes (p)))]}))
const reformat = extract (["em", "strong"])
const strHTML = "<p>Hello World</p><p>I am a text with <strong>bold</strong> word</p><p><strong>I am bold text with nested <em>italic</em> Word.</strong></p>"
console .log (reformat (strHTML))
.as-console-wrapper {max-height: 100% !important; top: 0}

这通过一个中间格式,这可能对其他用途有用:

[
{text: "Hello World", path: ["p"]},
{text: "I am a text with ", path: ["p"]},
{text: "bold", path: ["p", "strong"]},
{text: " word", path: ["p"]},
{text: "I am bold text with nested ", path: ["p", "strong"]},
{text: "italic", path: ["p", "strong", "em"]},
{text: " Word.", path: ["p", "strong"]}
]

虽然这看起来与您的最终格式相似,但path将整个标记历史记录包含到文本节点,并且可以用于各种目的。getTextNodes从给定节点提取此格式。因此,path可能看起来像["div", "div", "div", "nav", "ol", "li", "a", "div", "div", "strong"],具有重复的元素和许多非格式化标记。

extract中最后的map调用只是将此路径过滤到您配置的格式化标记集合中。

虽然我们可以很容易地一次完成,但getTextNodes本身是一个有用的函数,我们可以在系统的其他地方使用。

最新更新