如何在Rust中折叠字符串



我正在编写一个简单的全文搜索库,需要折叠大小写来检查两个单词是否相等。对于这个用例,现有的.to_lowercase().to_uppercase()方法是不够的。

从一个快速搜索的板条箱。在io中,我可以找到用于规范化和分词的库,但没有用于大小写折叠的库。regex-syntax确实有案例折叠代码,但它没有暴露在其API

对于我的用例,我发现无箱箱是最有用的。

据我所知,这是唯一支持规范化的库。这在你想要的时候很重要。"MHZ"(U+3392 SQUARE MHZ)和"MHZ"相匹配。有关其工作原理的详细信息,请参阅第3章- Unicode标准中的默认无大小写匹配。

下面是一些不区分大小写匹配字符串的示例代码:

extern crate caseless;
use caseless::Caseless;
let a = "100 ㎒";
let b = "100 mhz";
// These strings don't match with just case folding,
// but do match after compatibility (NFKD) normalization
assert!(!caseless::default_caseless_match_str(a, b));
assert!(caseless::compatibility_caseless_match_str(a, b));

要直接得到case折叠字符串,可以使用default_case_fold_str函数:

let s = "Twilight Sparkle ちゃん";
assert_eq!(caseless::default_case_fold_str(s), "twilight sparkle ちゃん");

Caseless也不公开相应的规范化函数,但您可以使用unicode-normalization crate编写一个:

extern crate unicode_normalization;
use caseless::Caseless;
use unicode_normalization::UnicodeNormalization;
fn compatibility_case_fold(s: &str) -> String {
    s.nfd().default_case_fold().nfkd().default_case_fold().nfkd().collect()
}
let a = "100 ㎒";
assert_eq!(compatibility_case_fold(a), "100 mhz");

注意,为了得到正确的结果,需要进行多轮规范化和大小写折叠。

(感谢BurntSushi5为我指出这个库)

截至今天(2023年),无箱箱看起来没有维护,而ICU4X项目似乎是要走的路。要应用箱子折叠,请参阅icu_casemapping板条箱。要根据与语言相关的约定比较字符串,请参阅icu_collator crate。有关如何在Rust中正确排序单词的良好介绍,请参阅此处。

有关Unicode理论和算法的文档,请参阅Unicode标准。特别是:

  • 箱子转换和箱子折叠:第3.13和5.18节。
  • <
  • 排序算法/gh>

有关ICU4X项目的文档,请参阅此处。

要使用ICU4X,您可以将主板条箱icu添加到Cargo.toml并访问单个模块(例如icu::collator, icu::datetime等),或者添加您实际需要的单个板条箱(例如icu_collator, icu_datetime等)。

要检查两个单词是否相等,无论大小写如何,您可以对字符串应用完整的大小写折叠,然后检查二进制相等。为此,您需要icu_casemapping::full_fold方法和icu_testdata::unstable这样的数据提供程序。注意,目前icu_casemapping的数据隐藏在icu_testdata/icu_casemapping特性后面,因此您需要在Cargo.toml文件中显式地导入它,如下:

[dependencies]
icu_casemapping = "0.7.1"
icu_testdata = { version = "1.1.2", features = ["icu_casemapping"] }

在未来的功能icu_testdata/icu_casemapping可能会添加到icu_testdata的默认功能,因为icu_casemapping是稳定的。

下面是一个使用icu_casemapping::full_fold方法的简单示例:

use icu_casemapping::CaseMapping;
fn main() {
    let str1 = "Hello";
    let str2 = "hello";
    assert_ne!(str1, str2);
    let case_mapping = CaseMapping::try_new(&icu_testdata::unstable()).unwrap();
    assert_eq!(case_mapping.full_fold(str1), case_mapping.full_fold(str2));
}

请注意,目前icu_casemapping crate不包括规范化,这可能会在将来添加,参见这里的讨论。

如果要根据与语言相关的约定比较字符串,您可以使用icu_collator crate,它允许自定义一些选项,例如强度和区域设置。您可以在这里找到几个例子。

如果有人想坚持使用标准库,我想要一些实际的数据在这。我取出了失败的两个字节字符的完整列表to_lowercaseto_uppercase。然后我运行这个测试:

fn lowercase(left: char, right: char) -> bool {
   for c in left.to_lowercase() {
      for d in right.to_lowercase() {
         if c == d { return true }
      }
   }
   false
}
fn uppercase(left: char, right: char) -> bool {
   for c in left.to_uppercase() {
      for d in right.to_uppercase() {
         if c == d { return true }
      }
   }
   false
}
fn main() {
   let pairs = &[
      &['u{00E5}','u{212B}'],&['u{00C5}','u{212B}'],&['u{0399}','u{1FBE}'],
      &['u{03B9}','u{1FBE}'],&['u{03B2}','u{03D0}'],&['u{03B5}','u{03F5}'],
      &['u{03B8}','u{03D1}'],&['u{03B8}','u{03F4}'],&['u{03D1}','u{03F4}'],
      &['u{03B9}','u{1FBE}'],&['u{0345}','u{03B9}'],&['u{0345}','u{1FBE}'],
      &['u{03BA}','u{03F0}'],&['u{00B5}','u{03BC}'],&['u{03C0}','u{03D6}'],
      &['u{03C1}','u{03F1}'],&['u{03C2}','u{03C3}'],&['u{03C6}','u{03D5}'],
      &['u{03C9}','u{2126}'],&['u{0392}','u{03D0}'],&['u{0395}','u{03F5}'],
      &['u{03D1}','u{03F4}'],&['u{0398}','u{03D1}'],&['u{0398}','u{03F4}'],
      &['u{0345}','u{1FBE}'],&['u{0345}','u{0399}'],&['u{0399}','u{1FBE}'],
      &['u{039A}','u{03F0}'],&['u{00B5}','u{039C}'],&['u{03A0}','u{03D6}'],
      &['u{03A1}','u{03F1}'],&['u{03A3}','u{03C2}'],&['u{03A6}','u{03D5}'],
      &['u{03A9}','u{2126}'],&['u{0398}','u{03F4}'],&['u{03B8}','u{03F4}'],
      &['u{03B8}','u{03D1}'],&['u{0398}','u{03D1}'],&['u{0432}','u{1C80}'],
      &['u{0434}','u{1C81}'],&['u{043E}','u{1C82}'],&['u{0441}','u{1C83}'],
      &['u{0442}','u{1C84}'],&['u{0442}','u{1C85}'],&['u{1C84}','u{1C85}'],
      &['u{044A}','u{1C86}'],&['u{0412}','u{1C80}'],&['u{0414}','u{1C81}'],
      &['u{041E}','u{1C82}'],&['u{0421}','u{1C83}'],&['u{1C84}','u{1C85}'],
      &['u{0422}','u{1C84}'],&['u{0422}','u{1C85}'],&['u{042A}','u{1C86}'],
      &['u{0463}','u{1C87}'],&['u{0462}','u{1C87}']
   ];
   let (mut upper, mut lower) = (0, 0);
   for pair in pairs.iter() {
      print!("U+{:04X} ", pair[0] as u32);
      print!("U+{:04X} pass: ", pair[1] as u32);
      if uppercase(pair[0], pair[1]) {
         print!("to_uppercase ");
         upper += 1;
      } else {
         print!("             ");
      }
      if lowercase(pair[0], pair[1]) {
         print!("to_lowercase");
         lower += 1;
      }
      println!();
   }
   println!("upper pass: {}, lower pass: {}", upper, lower);
}

下面的结果。有趣的是,其中一对没有做到这两点。但基于此,to_uppercase是最好的选择

U+00E5 U+212B pass:              to_lowercase
U+00C5 U+212B pass:              to_lowercase
U+0399 U+1FBE pass: to_uppercase
U+03B9 U+1FBE pass: to_uppercase
U+03B2 U+03D0 pass: to_uppercase
U+03B5 U+03F5 pass: to_uppercase
U+03B8 U+03D1 pass: to_uppercase
U+03B8 U+03F4 pass:              to_lowercase
U+03D1 U+03F4 pass:
U+03B9 U+1FBE pass: to_uppercase
U+0345 U+03B9 pass: to_uppercase
U+0345 U+1FBE pass: to_uppercase
U+03BA U+03F0 pass: to_uppercase
U+00B5 U+03BC pass: to_uppercase
U+03C0 U+03D6 pass: to_uppercase
U+03C1 U+03F1 pass: to_uppercase
U+03C2 U+03C3 pass: to_uppercase
U+03C6 U+03D5 pass: to_uppercase
U+03C9 U+2126 pass:              to_lowercase
U+0392 U+03D0 pass: to_uppercase
U+0395 U+03F5 pass: to_uppercase
U+03D1 U+03F4 pass:
U+0398 U+03D1 pass: to_uppercase
U+0398 U+03F4 pass:              to_lowercase
U+0345 U+1FBE pass: to_uppercase
U+0345 U+0399 pass: to_uppercase
U+0399 U+1FBE pass: to_uppercase
U+039A U+03F0 pass: to_uppercase
U+00B5 U+039C pass: to_uppercase
U+03A0 U+03D6 pass: to_uppercase
U+03A1 U+03F1 pass: to_uppercase
U+03A3 U+03C2 pass: to_uppercase
U+03A6 U+03D5 pass: to_uppercase
U+03A9 U+2126 pass:              to_lowercase
U+0398 U+03F4 pass:              to_lowercase
U+03B8 U+03F4 pass:              to_lowercase
U+03B8 U+03D1 pass: to_uppercase
U+0398 U+03D1 pass: to_uppercase
U+0432 U+1C80 pass: to_uppercase
U+0434 U+1C81 pass: to_uppercase
U+043E U+1C82 pass: to_uppercase
U+0441 U+1C83 pass: to_uppercase
U+0442 U+1C84 pass: to_uppercase
U+0442 U+1C85 pass: to_uppercase
U+1C84 U+1C85 pass: to_uppercase
U+044A U+1C86 pass: to_uppercase
U+0412 U+1C80 pass: to_uppercase
U+0414 U+1C81 pass: to_uppercase
U+041E U+1C82 pass: to_uppercase
U+0421 U+1C83 pass: to_uppercase
U+1C84 U+1C85 pass: to_uppercase
U+0422 U+1C84 pass: to_uppercase
U+0422 U+1C85 pass: to_uppercase
U+042A U+1C86 pass: to_uppercase
U+0463 U+1C87 pass: to_uppercase
U+0462 U+1C87 pass: to_uppercase
upper pass: 46, lower pass: 8

unicase crate不直接暴露大小写折叠,但它提供了一个通用的包装器类型,以不区分大小写的方式实现Eq, OrdHash。主分支(未发布)支持ASCII大小写折叠(作为优化)和Unicode大小写折叠(尽管只支持不变大小写折叠)。

最新更新