Java字符串令牌化:在图案上分开并保留模式



我的问题是python上此查询的scala(java)变体。

特别是我有一个字符串val myStr = "Shall we meet at, let's say, 8:45 AM?"。我想将其引入保留分界符(除了whitespace之外)。如果我的定界符只是字符,例如.:?等,我可以做:

val strArr = myStr.split("((\s+)|(?=[,.;:?])|(?<=\b[,.;:?]))")

产生

[Shall, we, meet, at, ,, let's, say, ,, 8, :, 45, AM, ?]

但是,我希望使时间签名\d+:\d+成为定界符,并且仍然希望保留它。所以,我想要的是

[Shall, we, meet, at, ,, let's, say, ,, 8:45, AM, ?]

注意:

  1. 在拆分语句的表达式中添加分离的(?=(\d+:\d+))没有帮助
  2. 在时间签名之外,:本身就是一个定界线

我怎么能实现这一目标?

我建议我匹配所有令牌,而不是分开字符串,因为这样您可以以更好的方式控制自己获得的东西:

 bd{1,2}:d{2}b|[,.;:?]+|(?:(?!bd{1,2}:d{2}b)[^s,.;:?])+

请参阅正则演示。

我们开始匹配最特定的模式,最后一个是最通用的模式。

详细信息

  • bd{1,2}:d{2}b-1至2位数字,:,2个用字边界包含的数字
  • |-或
  • [,.;:?]+-1或更多,.;:? chars
  • |-或
  • (?:(?!bd{1,2}:d{2}b)[^s,.;:?])+-匹配任何不是我们的定界线char或whitespace([^s,.;:?])的字符,而不是时间字符串的起点。

考虑此片段:

val str = "Shall we meet at, let's say, 8:45 AM?"
var rx = """bd{1,2}:d{2}b|[,.;:?]+|(?:(?!bd{1,2}:d{2}b)[^s,.;:?])+""".r
rx findAllIn str foreach println

输出:

Shall
we
meet
at
,
let's
say
,
8:45
AM
?
/**
 * StringPatternTokenizer is simlular to java.util.StringTokenizer
 * But it uses regex string as the tokenizer separator.
 * See inside method #testCase for detail usage.
 */
public class StringPatternTokenizer {
    Pattern pattern;
    public StringPatternTokenizer(String regex) {
        this.pattern = Pattern.compile(regex);
    }
    public void getTokens(String str, NextToken nextToken) {
        Matcher matcher = pattern.matcher(str);
        int index = 0;
        Result result = null;
        while (matcher.find()) {
            if (matcher.start() > index) {
                result = nextToken.visit(null, str.substring(index, matcher.start()));
            }
            if (result != Result.STOP) {
                index = matcher.end();
                result = nextToken.visit(matcher, null);
            }
            if (result == Result.STOP) {
                return;
            }
        }
        if (index < str.length()) {
            nextToken.visit(null, str.substring(index));
        }
    }
    enum Result {
        CONTINUE,
        STOP,
    }
    public interface NextToken {
        Result visit(Matcher matcher, String str);
    }
    /***********************************/
    /***** test cases FOR IT ***********/
    /***********************************/
    public void testCase() {
        // as a test, it tries access tokenizer result for each part,
        // then replace variable parts by given values.
        // And finally, we collect the result target string as  output.
        String strSource = "My name is {{NAME}}, nice to meet you.";
        String strTarget = "My name is TokenTst, nice to meet you.";
        // separator pattern for: variable names in two curly brackets
        String variableRegex = "\{\{([A-Za-z]+)\}\}";
        // variable values
        org.json.JSONObject data = new org.json.JSONObject(
                java.util.Collections.singletonMap("NAME", "TokenTst")
        );
        StringBuilder sb = new StringBuilder();
        new StringPatternTokenizer(variableRegex)
                .getTokens(strSource, (matcher, str) -> {
                    sb.append(matcher == null ? str
                            : data.optString(matcher.group(1), ""));
                    return StringPatternTokenizer.Result.CONTINUE;
                });
        // check the result as expected
        org.junit.Assert.assertEquals(strTarget, sb.toString());
    }
}

最新更新