如何剪切HTML以保留结束标记



如何创建以HTML存储的博客文章的预览?换言之,我怎么能";切割";HTML,确保标签正确关闭?目前,我正在前端渲染整个内容(使用react的dangerouslySetInnerHTML(,然后设置overflow: hiddenheight: 150px。我更喜欢一种可以直接剪切HTML的方式。这样,我就不需要将整个HTML流发送到前端;如果我有10篇博客文章的预览,那将是大量发送的HTML,访问者甚至看不到。

如果我有HTML(假设这是整个博客文章(

<body>
<h1>Test</h1>
<p>This is a long string of text that I may want to cut.. blah blah blah foo bar bar foo bar bar</p>
</body>

尝试对其进行切片(进行预览(是行不通的,因为标签将变得不匹配:

<body>
<h1>Test</h1>
<p>This is a long string of text <!-- Oops! unclosed tags -->

我真正想要的是:

<body>
<h1>Test</h1>
<p>This is a long string of text</p>
</body>

我使用的是next.js,所以任何node.js解决方案都应该可以正常工作。有没有一种方法可以做到这一点(例如next.js服务器端的库(?还是我只需要自己(服务器端(解析HTML,然后修复未关闭的标记?

发布预览


这是一项具有挑战性的任务,让我挣扎了大约两天,并让我发布了我的第一个NPM后预览,它可以解决您的问题。所有内容都在它的自述中进行了描述,但如果你想知道如何使用它来解决你的特定问题:

首先,使用NPM安装软件包,或从GitHub下载其源代码

然后,您可以在用户将其博客文章发布到服务器之前使用它,并将其结果(预览(和完整的文章发送到后端,验证其长度并净化其html,将其保存到后端存储(DB等(,当您想向用户显示博客文章预览而不是完整的文章时,将其发送回用户。

示例:

以下代码将接受.blogPostContainerHTMLElement作为输入,并返回其总结的HTML字符串版本,长度*最大为200个字符。

您可以在"预览容器".preview:中看到预览

js:

import  postPreview  from  "post-preview";
const  postContainer = document.querySelector(".blogPostContainer");
const  previewContainer = document.querySelector(".preview");
previewContainer.innerHTML = postPreview(postContainer, 200);

html(完整的博客文章(:

<div class="blogPostContainer">
<div>
<h2>Lorem ipsum</h2>
<p>
Lorem ipsum, dolor sit amet consectetur adipisicing elit. Neque, fugit hic! Quas similique
cupiditate illum vitae eligendi harum. Magnam quam ex dolor nihil natus dolore voluptates
accusantium. Reprehenderit, explicabo blanditiis?
</p>
</div>
<p>
Lorem ipsum dolor sit amet consectetur adipisicing elit. Ipsam non incidunt, corporis debitis
ducimus eum iure sed ab. Impedit, doloribus! Quos accusamus eos, incidunt enim amet maiores
doloribus placeat explicabo.Eaque dolores tempore, quia temporibus placeat, consequuntur hic
ullam quasi rem eveniet cupiditate est aliquam nisi aut suscipit fugit maiores ad neque sunt
atque explicabo unde! Explicabo quae quia voluptatem.
</p>
</div>
<div class="preview"></div>

结果(博客文章预览(:

<div class="preview">
<div class="blogPostContainer">
<div>
<h2>Lorem ipsum</h2>
<p>
Lorem ipsum, dolor sit amet consectetur adipisicing elit. Neque, fugit hic! Quas similique
cupiditate illum vitae eligendi ha
</p>
</div>
</div>
</div>

这是一个同步任务,所以如果你想一次针对多个帖子运行它,你最好在工作人员中运行它,以获得更好的性能。

谢谢你让我做一些研究!

祝你好运!

猜测每个预渲染元素的高度非常复杂。但是,您可以使用以下伪规则按字符数剪切条目:

    1. 首先定义要保留的最大字符数
    1. 从一开始:如果遇到HTML标记(通过正则化< .. >< .. />来识别它(,请查找结束标记
    1. 然后从停止的位置继续搜索标签

我刚刚写的javascript中的一个快速建议(可能可以改进,但这就是想法(:

let str = `<body>
<h1>Test</h1>
<p>This is a long string of text that I may want to cut.. blah blah blah foo bar bar foo bar bar</p>
</body>`;
const MAXIMUM = 100; // Maximum characters for the preview
let currentChars = 0; // Will hold how many characters we kept until now
let list = str.split(/(</?[A-Za-z0-9]*>)/g); // split by tags
const isATag = (s) => (s[0] === '<'); // Returns true if it is a tag
const tagName = (s) => (s.replace('<', '').replace('>', '').replace('/', '')) // Get the tag name
const findMatchingTag = (list, i) => {
let name = tagName(list[i]);
let searchingregex = new RegExp(`</ *${name} *>`,'g'); // The regex for closing mathing tag
let sametagregex = new RegExp(`< *${name} *>`,'g'); // The regex for mathing tag (in case there are inner scoped same tags, we want to pass those)
let buffer = 0; // Will count how many tags with the same name are in an inner hirarchy level, we need to pass those
for(let j=i+1;j<list.length;j++){
if(list[j].match(sametagregex)!=null) buffer++;
if(list[j].match(searchingregex)!=null){
if(buffer>0) buffer--;
else{
return j;
}
}
}
return -1;
}
let k = 0;
let endCut = false;
let cutArray = new Array(list.length);
while (currentChars < MAXIMUM && !endCut && k < list.length) { // As long we are still within the limit of characters and within the array
if (isATag(list[k])) { // Handling tags, finding the matching tag
let matchingTagindex = findMatchingTag(list, k);
if (matchingTagindex != -1) {
if (list[k].length + list[matchingTagindex].length + currentChars < MAXIMUM) { // If icluding both the tag and its closing exceeds the limit, do not include them and end the cut proccess
currentChars += list[k].length + list[matchingTagindex].length;
cutArray[k] = list[k];
cutArray[matchingTagindex] = list[matchingTagindex];
}
else {
endCut = true;
}
}
else {
if (list[k].length + currentChars < MAXIMUM) { // If icluding the tag exceeds the limit, do not include them and end the cut proccess
currentChars += list[k].length;
cutArray[k] = list[k];
}
else {
endCut = true;
}
}
}
else { // In case it isn't a tag - trim the text
let cutstr = list[k].substring(0, MAXIMUM - currentChars)
currentChars += cutstr.length;
cutArray[k] = cutstr;
}
k++;
}
console.log(cutArray.join(''))

我使用了SomoKRoceS提出的解决方案,它确实对我有所帮助。但后来我发现了一些问题:

  1. 如果超过限制的html内容被包装在一个标签中,它将完全省略它
  2. 如果标签包含任何属性,如class="width100"style="text-align:center",它将不会与提供的regExp匹配

我已经做了一些调整来克服这些问题,这个解决方案将精确地减少纯文本的数量以满足限制,并保留所有html包装。

class HtmlTrimmer {
HTML_TAG_REGEXP = /(</?[a-zA-Z]+[s a-zA-Z0-9="'-;:%]*[^<]*>)/g;
// <p style="align-items: center; width: 100%;">
HTML_TAGNAME_REGEXP = /</?([a-zA-Z0-9]+)[sa-zA-Z0-9="'-_:;%]*>/;
getPlainText(html) {
return html
.split(this.HTML_TAG_REGEXP)
.filter(text => !this.isTag(text))
.map(text => text.trim())
.join('');
}
isTag(text) {
return text[0] === '<';
}
getTagName(tag) {
return tag.replace(this.HTML_TAGNAME_REGEXP, '$1');
}
findClosingTagIndex(list, openedTagIndex) {
const name = this.getTagName(list[openedTagIndex]);
// The regex for closing matching tag
const closingTagRegex = new RegExp(`</ *${name} *>`, 'g');
// The regex for matching tag (in case there are inner scoped same tags, we want to pass those)
const sameTagRegex = new RegExp(`< *${name}[\sa-zA-Z0-9="'-_:;%]*>`, 'g');
// Will count how many tags with the same name are in an inner hierarchy level, we need to pass those
let sameTagsInsideCount = 0;
for (let j = openedTagIndex + 1; j < list.length; j++) {
if (list[j].match(sameTagRegex) !== null) sameTagsInsideCount++;
if (list[j].match(closingTagRegex) !== null) {
if (sameTagsInsideCount > 0) sameTagsInsideCount--;
else {
return j;
}
}
}
return -1;
}
trimHtmlContent(html: string, limit: number): string {
let trimmed = '';
const innerItems = html.split(this.HTML_TAG_REGEXP);
for (let i = 0; i < innerItems.length; i++) {
const item = innerItems[i];
const trimmedTextLength = this.getPlainText(trimmed).length;
if (this.isTag(item)) {
const closingTagIndex = this.findClosingTagIndex(innerItems, i);
if (closingTagIndex === -1) {
trimmed = trimmed + item;
} else {
const innerHtml = innerItems.slice(i + 1, closingTagIndex).join('');
trimmed = trimmed
+ item
+ this.trimHtmlContent(innerHtml, limit - trimmedTextLength )
+ innerItems[closingTagIndex];
i = closingTagIndex;
}
} else {
if (trimmedTextLength + item.length > limit) {
trimmed = trimmed + item.slice(0, limit - trimmedTextLength);
return trimmed + '...';
} else {
trimmed = trimmed + item;
}
}
}
return trimmed;
}
}

const htmlTrimmer = new HtmlTrimmer();
const trimmedHtml = htmlTrimmer.trimHtmlContent(html, 100);

最新更新