捕获标签之间的<pre></pre>所有内容

我正在.html文件中阅读：

const htmlin = String(fs.readFileSync(inputHtml) || '');
const splitted = htmlin.split(/<pre.*>/);
splitted.shift();
const justPost = splitted.join('').split('</pre>');
justPost.pop();

，但我正在寻找一种匹配

中所有文本的方法

aaa <pre> xxx </pre> bbb <pre> foo </pre> ccc

，也匹配外面的文字。这样我就可以得到两个阵列：

['aaa ', ' bbb ', ' ccc']

和

[' xxx ', ' foo ']

我该如何使用Regex或其他方法进行操作？

一种方法是使用正则替换功能和捕获组。

<pre>(.*?)(?=</pre>)|(?:^|</pre>)(.*?)(?=$|<pre>)

<pre>(.*?)(?=</pre>)-匹配pre标签之间的文本。（G1）
(?:^|</pre>)(.*?)(?=$|<pre>)-匹配pre标签的文本。（G2）

let str = `aaa <pre> xxx </pre> bbb <pre> foo </pre> ccc`
let inner = []
let outer = []
let op = str.replace(/<pre>(.*?)(?=</pre>)|(?:^|</pre>)(.*?)(?=$|<pre>)/g, function (match,g1,g2){
  if(g1){
    inner.push(g1.trim())
  } 
  if(g2){
    outer.push(g2.trim())
  }
  return match
})
console.log(outer)
console.log(inner)

而不是使用正则表达式，您可以使用dom或domparser。

例如，创建一个DIV并将InnerHTML属性设置为HTML。然后循环童鸣并获取innerhtml或文本以下。

例如：

let htmlString = `aaa <pre> xxx </pre> bbb <pre> foo </pre> ccc`,
  pre = [],
  text = [];
let div = document.createElement('div');
div.innerHTML = htmlString;
div.childNodes.forEach(x => {
  if (x.nodeType === Node.TEXT_NODE) {
    text.push(x.textContent.trim())
  }
  if (x.nodeName === "PRE") {
    pre.push(x.innerHTML.trim());
  }
});
console.log(pre);
console.log(text);

我使用re.dotall查找

和

之间的数据，然后在马车返回上拆分

txt="""111 abc<pre>seven
eight
nine
ten
eleven
twelve</pre>
<pre> one 
two 
three 
four 
five 
six </pre>def"""
results= re.findall(r'<pre>(.*?)</pre>', txt,re.DOTALL)
print(results)
word_list=[]
for item in results:
    print(item)
    words=item.split("n")
    for word in words:
        word_list.append(word)
        
print(word_list)

，因为您可能在＆lt; pre＆gt;内部具有HTML标签 - 我个人会放置一个在html中不存在的标记最终标签�/pre＆gt;像这样。然后，我将从pre标签的开头搜索

const myTextWithMarker = myText.replace('</pre>', '¬</pre>');
const regResult = myTextWithMarker.match(/<pre( [^>]*)?>([^¬]*)/);
const myContent = regResult[0]

相关内容

最新更新

热门标签：