使用正则表达式和重构原始字符串



我有这样的文本-

This is a test text. <span> with bold </span> and with <span> italic </span> and so on and so forth.

现在,我使用这个正则表达式来识别所有的html <[^>]*>然后我用空字符串替换所有的html,所以结果会像这个

This is a test text. with bold and with italic and so and so forth.

在上面的文本中,我想识别文本,比如"斜体",并在其周围插入特殊标签,然后重建原始文本。因此,结果将是

This is a test text. <span> with bold </span> and with <span> <span class='special'>italic</span> </span> and so on and so forth.

我正在创建获取matcher.start()和matcher.end()的代码,以生成所有html标签的列表,然后我正在考虑基于该列表进行重构。有更好的方法吗?你将如何解决它?

编辑

替换html后搜索文本的原因是,html干扰了我要查找的文本。例如,它可能像这个

This is a test text. <span> with bold </span> and with <span> it</span>al<span>ic </span> and so on and so forth.

EDIT2

这不是一个重复的问题,就像它被建议的那样。想象一个场景,你需要突出显示你在屏幕上看到的html,只需在你选择的文本中添加一个背景色为黄色的简单跨度。现在,假设这个文本是单词italic,但它显示为<span>ita</span>l<span>ic</span>。我的问题是,你如何找到这个词,然后在它周围加上跨度?

编辑3最终编辑以简化问题陈述。我希望这能说明问题。这是输入-

This is a test text with <span>it<span>al<span>ic</span> and etc.

这是预期输出-

This is a test text with <span class='highlight'><span>it<span>al<span>ic</span></span> and etc.

这将完成您想要的操作,但它不能检测/防止错误的html生成。

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class HtmlHighlighter {
  private final String inputWithoutTags;
  private final List<Tag> tags;
  private static class Tag {
    private final String text;
    private final int startPos;
    private Tag(final String text, final int startPos) {
      this.text = text;
      this.startPos = startPos;
    }
  }
  public HtmlHighlighter(final String input, final String tagRegex) {
    final Pattern p = Pattern.compile(tagRegex);
    tags = new ArrayList<>();
    final Matcher m = p.matcher(input);
    StringBuffer sb = new StringBuffer();
    int cursor = 0;
    int cursorExcludingTags = 0;
    while (m.find()) {
      cursorExcludingTags += m.start() - cursor;
      tags.add(new Tag(input.substring(m.start(), m.end()), cursorExcludingTags));
      cursor = m.end();
      m.appendReplacement(sb, "");
    }
    m.appendTail(sb);
    inputWithoutTags = sb.toString();
  }
  public String highlightText(String regexToFind, String openingTag, String closingTag) {
    final List<Tag> allTags = getAllTags(regexToFind, openingTag, closingTag);
    return combineTags(allTags);
  }
  private List<Tag> getAllTags(final String regexToFind, final String openingTag, final String closingTag) {
    final List<Tag> ret = new ArrayList<>(tags);
    final Pattern p = Pattern.compile(regexToFind);
    final Matcher m = p.matcher(inputWithoutTags);
    while (m.find()) {
      addTag(new Tag(openingTag, m.start()), true, ret);
      addTag(new Tag(closingTag, m.end()), false, ret);
    }
    return ret;
  }
  private void addTag(final Tag tag, final boolean beforeIgnored, final List<Tag> allTags) {
    for (int i = 0; i < allTags.size(); i++) {
      if (allTags.get(i).startPos >= tag.startPos && beforeIgnored) {
        allTags.add(i, tag);
        return;
      }
      if (allTags.get(i).startPos > tag.startPos) {
        allTags.add(i, tag);
        return;
      }
    }
    allTags.add(allTags.size(), tag);
  }
  private String combineTags(final List<Tag> allTags) {
    final StringBuilder sb = new StringBuilder(inputWithoutTags);
    for (int i = allTags.size() - 1; i >= 0; i--) {
      final Tag tag = allTags.get(i);
      sb.insert(tag.startPos, tag.text);
    }
    return sb.toString();
  }
  public static void main(String... args) {
    final HtmlHighlighter highlighter = new HtmlHighlighter("This is a test text with <span>it<span>al<span>ic</span> and etc.", "\<.*?\>");
    System.out.println(highlighter.highlightText("italic", "<span class='highlight'>", "</span>"));
  }
}

最新更新