我正在编写一个简单的全文搜索库,需要折叠大小写来检查两个单词是否相等。对于这个用例,现有的.to_lowercase()
和.to_uppercase()
方法是不够的。
从一个快速搜索的板条箱。在io中,我可以找到用于规范化和分词的库,但没有用于大小写折叠的库。regex-syntax
确实有案例折叠代码,但它没有暴露在其API
对于我的用例,我发现无箱箱是最有用的。
据我所知,这是唯一支持规范化的库。这在你想要的时候很重要。"MHZ"(U+3392 SQUARE MHZ)和"MHZ"相匹配。有关其工作原理的详细信息,请参阅第3章- Unicode标准中的默认无大小写匹配。
下面是一些不区分大小写匹配字符串的示例代码:
extern crate caseless;
use caseless::Caseless;
let a = "100 ㎒";
let b = "100 mhz";
// These strings don't match with just case folding,
// but do match after compatibility (NFKD) normalization
assert!(!caseless::default_caseless_match_str(a, b));
assert!(caseless::compatibility_caseless_match_str(a, b));
要直接得到case折叠字符串,可以使用default_case_fold_str
函数:
let s = "Twilight Sparkle ちゃん";
assert_eq!(caseless::default_case_fold_str(s), "twilight sparkle ちゃん");
Caseless也不公开相应的规范化函数,但您可以使用unicode-normalization crate编写一个:
extern crate unicode_normalization;
use caseless::Caseless;
use unicode_normalization::UnicodeNormalization;
fn compatibility_case_fold(s: &str) -> String {
s.nfd().default_case_fold().nfkd().default_case_fold().nfkd().collect()
}
let a = "100 ㎒";
assert_eq!(compatibility_case_fold(a), "100 mhz");
注意,为了得到正确的结果,需要进行多轮规范化和大小写折叠。
(感谢BurntSushi5为我指出这个库)
截至今天(2023年),无箱箱看起来没有维护,而ICU4X项目似乎是要走的路。要应用箱子折叠,请参阅icu_casemapping
板条箱。要根据与语言相关的约定比较字符串,请参阅icu_collator
crate。有关如何在Rust中正确排序单词的良好介绍,请参阅此处。
有关Unicode理论和算法的文档,请参阅Unicode标准。特别是:
- 箱子转换和箱子折叠:第3.13和5.18节。 <
- 排序算法/gh>
有关ICU4X项目的文档,请参阅此处。
要使用ICU4X,您可以将主板条箱icu
添加到Cargo.toml
并访问单个模块(例如icu::collator
, icu::datetime
等),或者添加您实际需要的单个板条箱(例如icu_collator
, icu_datetime
等)。
要检查两个单词是否相等,无论大小写如何,您可以对字符串应用完整的大小写折叠,然后检查二进制相等。为此,您需要icu_casemapping::full_fold
方法和icu_testdata::unstable
这样的数据提供程序。注意,目前icu_casemapping
的数据隐藏在icu_testdata/icu_casemapping
特性后面,因此您需要在Cargo.toml
文件中显式地导入它,如下:
[dependencies]
icu_casemapping = "0.7.1"
icu_testdata = { version = "1.1.2", features = ["icu_casemapping"] }
在未来的功能icu_testdata/icu_casemapping
可能会添加到icu_testdata
的默认功能,因为icu_casemapping
是稳定的。
下面是一个使用icu_casemapping::full_fold
方法的简单示例:
use icu_casemapping::CaseMapping;
fn main() {
let str1 = "Hello";
let str2 = "hello";
assert_ne!(str1, str2);
let case_mapping = CaseMapping::try_new(&icu_testdata::unstable()).unwrap();
assert_eq!(case_mapping.full_fold(str1), case_mapping.full_fold(str2));
}
请注意,目前icu_casemapping
crate不包括规范化,这可能会在将来添加,参见这里的讨论。
如果要根据与语言相关的约定比较字符串,您可以使用icu_collator
crate,它允许自定义一些选项,例如强度和区域设置。您可以在这里找到几个例子。
如果有人想坚持使用标准库,我想要一些实际的数据在这。我取出了失败的两个字节字符的完整列表to_lowercase
或to_uppercase
。然后我运行这个测试:
fn lowercase(left: char, right: char) -> bool {
for c in left.to_lowercase() {
for d in right.to_lowercase() {
if c == d { return true }
}
}
false
}
fn uppercase(left: char, right: char) -> bool {
for c in left.to_uppercase() {
for d in right.to_uppercase() {
if c == d { return true }
}
}
false
}
fn main() {
let pairs = &[
&['u{00E5}','u{212B}'],&['u{00C5}','u{212B}'],&['u{0399}','u{1FBE}'],
&['u{03B9}','u{1FBE}'],&['u{03B2}','u{03D0}'],&['u{03B5}','u{03F5}'],
&['u{03B8}','u{03D1}'],&['u{03B8}','u{03F4}'],&['u{03D1}','u{03F4}'],
&['u{03B9}','u{1FBE}'],&['u{0345}','u{03B9}'],&['u{0345}','u{1FBE}'],
&['u{03BA}','u{03F0}'],&['u{00B5}','u{03BC}'],&['u{03C0}','u{03D6}'],
&['u{03C1}','u{03F1}'],&['u{03C2}','u{03C3}'],&['u{03C6}','u{03D5}'],
&['u{03C9}','u{2126}'],&['u{0392}','u{03D0}'],&['u{0395}','u{03F5}'],
&['u{03D1}','u{03F4}'],&['u{0398}','u{03D1}'],&['u{0398}','u{03F4}'],
&['u{0345}','u{1FBE}'],&['u{0345}','u{0399}'],&['u{0399}','u{1FBE}'],
&['u{039A}','u{03F0}'],&['u{00B5}','u{039C}'],&['u{03A0}','u{03D6}'],
&['u{03A1}','u{03F1}'],&['u{03A3}','u{03C2}'],&['u{03A6}','u{03D5}'],
&['u{03A9}','u{2126}'],&['u{0398}','u{03F4}'],&['u{03B8}','u{03F4}'],
&['u{03B8}','u{03D1}'],&['u{0398}','u{03D1}'],&['u{0432}','u{1C80}'],
&['u{0434}','u{1C81}'],&['u{043E}','u{1C82}'],&['u{0441}','u{1C83}'],
&['u{0442}','u{1C84}'],&['u{0442}','u{1C85}'],&['u{1C84}','u{1C85}'],
&['u{044A}','u{1C86}'],&['u{0412}','u{1C80}'],&['u{0414}','u{1C81}'],
&['u{041E}','u{1C82}'],&['u{0421}','u{1C83}'],&['u{1C84}','u{1C85}'],
&['u{0422}','u{1C84}'],&['u{0422}','u{1C85}'],&['u{042A}','u{1C86}'],
&['u{0463}','u{1C87}'],&['u{0462}','u{1C87}']
];
let (mut upper, mut lower) = (0, 0);
for pair in pairs.iter() {
print!("U+{:04X} ", pair[0] as u32);
print!("U+{:04X} pass: ", pair[1] as u32);
if uppercase(pair[0], pair[1]) {
print!("to_uppercase ");
upper += 1;
} else {
print!(" ");
}
if lowercase(pair[0], pair[1]) {
print!("to_lowercase");
lower += 1;
}
println!();
}
println!("upper pass: {}, lower pass: {}", upper, lower);
}
下面的结果。有趣的是,其中一对没有做到这两点。但基于此,to_uppercase是最好的选择。
U+00E5 U+212B pass: to_lowercase
U+00C5 U+212B pass: to_lowercase
U+0399 U+1FBE pass: to_uppercase
U+03B9 U+1FBE pass: to_uppercase
U+03B2 U+03D0 pass: to_uppercase
U+03B5 U+03F5 pass: to_uppercase
U+03B8 U+03D1 pass: to_uppercase
U+03B8 U+03F4 pass: to_lowercase
U+03D1 U+03F4 pass:
U+03B9 U+1FBE pass: to_uppercase
U+0345 U+03B9 pass: to_uppercase
U+0345 U+1FBE pass: to_uppercase
U+03BA U+03F0 pass: to_uppercase
U+00B5 U+03BC pass: to_uppercase
U+03C0 U+03D6 pass: to_uppercase
U+03C1 U+03F1 pass: to_uppercase
U+03C2 U+03C3 pass: to_uppercase
U+03C6 U+03D5 pass: to_uppercase
U+03C9 U+2126 pass: to_lowercase
U+0392 U+03D0 pass: to_uppercase
U+0395 U+03F5 pass: to_uppercase
U+03D1 U+03F4 pass:
U+0398 U+03D1 pass: to_uppercase
U+0398 U+03F4 pass: to_lowercase
U+0345 U+1FBE pass: to_uppercase
U+0345 U+0399 pass: to_uppercase
U+0399 U+1FBE pass: to_uppercase
U+039A U+03F0 pass: to_uppercase
U+00B5 U+039C pass: to_uppercase
U+03A0 U+03D6 pass: to_uppercase
U+03A1 U+03F1 pass: to_uppercase
U+03A3 U+03C2 pass: to_uppercase
U+03A6 U+03D5 pass: to_uppercase
U+03A9 U+2126 pass: to_lowercase
U+0398 U+03F4 pass: to_lowercase
U+03B8 U+03F4 pass: to_lowercase
U+03B8 U+03D1 pass: to_uppercase
U+0398 U+03D1 pass: to_uppercase
U+0432 U+1C80 pass: to_uppercase
U+0434 U+1C81 pass: to_uppercase
U+043E U+1C82 pass: to_uppercase
U+0441 U+1C83 pass: to_uppercase
U+0442 U+1C84 pass: to_uppercase
U+0442 U+1C85 pass: to_uppercase
U+1C84 U+1C85 pass: to_uppercase
U+044A U+1C86 pass: to_uppercase
U+0412 U+1C80 pass: to_uppercase
U+0414 U+1C81 pass: to_uppercase
U+041E U+1C82 pass: to_uppercase
U+0421 U+1C83 pass: to_uppercase
U+1C84 U+1C85 pass: to_uppercase
U+0422 U+1C84 pass: to_uppercase
U+0422 U+1C85 pass: to_uppercase
U+042A U+1C86 pass: to_uppercase
U+0463 U+1C87 pass: to_uppercase
U+0462 U+1C87 pass: to_uppercase
upper pass: 46, lower pass: 8
unicase crate不直接暴露大小写折叠,但它提供了一个通用的包装器类型,以不区分大小写的方式实现Eq
, Ord
和Hash
。主分支(未发布)支持ASCII大小写折叠(作为优化)和Unicode大小写折叠(尽管只支持不变大小写折叠)。