是否有任何方法枚举Ruby中所有字符的Unicode属性?我可以使用Ruby 1.9的Regexp类来测试给定字符是否具有特定属性(例如,some_char =~ /p{P}/
测试some_char
是否为标点符号,等等)…但由于字符可以有多个属性(例如,(
是标点符号和 ASCII等),因此能够获得所有字符属性的列表将是很好的。
我可能可以用unicode_data.txt
手工完成这个,或者不管它叫什么,但这似乎是一种可能已经在某处完成的事情。UnicodeUtils
似乎没有这些线索,谷歌也没有发现任何明显的东西。谢谢!
你可以调用我的uniprops脚本。
$ uniprops -p delta greek:delta Greek:Delta
U+1E9F ‹ẟ› N{ LATIN SMALL LETTER DELTA }:
w pL p{LC} p{L_} p{L&} p{Ll}
U+03B4 ‹δ› N{ GREEK SMALL LETTER DELTA }:
w pL p{LC} p{L_} p{L&} p{Ll}
U+0394 ‹Δ› N{ GREEK CAPITAL LETTER DELTA }:
w pL p{LC} p{L_} p{L&} p{Lu}
$ uniprops # ç π
U+0023 ‹#› N{ NUMBER SIGN }:
pP p{Po}
All Any ASCII Assigned Common Zyyy Po P Gr_Base
Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn
Pattern_Syntax PatSyn PosixGraph PosixPrint PosixPunct
Print Punctuation
U+00E7 ‹ç› N{ LATIN SMALL LETTER C WITH CEDILLA }:
w pL p{LC} p{L_} p{L&} p{Ll}
All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased
Cased_Letter LC Changes_When_Casemapped CWCM
Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll
L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC
ID_Start IDS Letter L_ Latin Latn Lowercase_Letter Lower
Lowercase Print Word XID_Continue XIDC XID_Start XIDS
U+03C0 ‹π› N{ GREEK SMALL LETTER PI }:
w pL p{LC} p{L_} p{L&} p{Ll}
All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek
InGreek Cased Cased_Letter LC Changes_When_Casemapped CWCM
Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll
L Gr_Base Grapheme_Base Graph GrBase Grek Greek_And_Coptic
ID_Continue IDC ID_Start IDS Letter L_ Lowercase_Letter
Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS
$ uniprops -a 'MICRO SIGN'
U+00B5 ‹µ› N{MICRO SIGN}
w pL p{LC} p{L_} p{L&} p{Ll}
All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM
Changes_When_NFKC_Casefolded CWKCF Changes_When_Titlecased CWT Changes_When_Uppercased CWU Common Zyyy Ll L Gr_Base Grapheme_Base
Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin_1 Latin_1_Supplement Lowercase_Letter Lower Lowercase Print Word
XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word
Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Latin_1 Block=Latin_1_Supplement BLK=Latin1 Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=Com
Decomposition_Type=Compat DT=Com Decomposition_Type=Non_Canon Decomposition_Type=Non_Canonical DT=NonCanon East_Asian_Width=Neutral
Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA
Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic
LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1
Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0
Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=LO Sentence_Break=Lower SB=LO
Word_Break=ALetter WB=LE Word_Break=LE _X_Begin
$ uniprops -a 2011
U+2011 ‹‑› N{NON-BREAKING HYPHEN}
pP p{Pd}
All Any Assigned InGeneralPunctuation Changes_When_NFKC_Casefolded CWKCF Common Zyyy Dash Dash_Punctuation Pd P General_Punctuation
Gr_Base Grapheme_Base Graph GrBase Punct Pat_Syn Pattern_Syntax PatSyn Print Punctuation X_POSIX_Graph X_POSIX_Print X_POSIX_Punct
Age=1.1 Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=General_Punctuation Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=Nb
Decomposition_Type=Nobreak DT=Nb Decomposition_Type=Non_Canon Decomposition_Type=Non_Canonical DT=NonCanon East_Asian_Width=Neutral
Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA
Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=GL Line_Break=Glue LB=GL
Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0
IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1
IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=Other SB=XX Sentence_Break=XX Word_Break=Other
WB=XX Word_Break=XX _X_Begin
$ uniprops -l | grep Greek | sort -dfu
Blk=Greek
Block:Ancient_Greek_Musical_Notation
Block:Ancient_Greek_Numbers
Block:Greek
Block=Greek_And_Coptic
Block:Greek_Extended
Greek
Greek_And_Coptic
InAncientGreekMusicalNotation
InAncientGreekNumbers
InGreek
InGreekExtended
Is_Greek
Script=Greek
你可能也想要独角兽,这样你就可以走另一条路了。下面是调用它的例子:
$ unichars -gns 'p{Cased}' 'p{Number}'
$ unichars 'R'
$ unichars 'S' '[vh]'
$ unichars 'S' 'p{space}'
$ unichars 'pL' 'p{Greek}'
$ unichars 'pL' 'p{Greek}' | um
$ unichars 'p{Age=6.0}' | um
$ unichars 'p{Lowercase}' 'P{Lowercase_Letter}'
$ unichars 'p{Lower}' 'P{Ll}' # same but easier to type
$ unichars -a 'p{alphabetic}' 'P{Letter}' | wc -l # 1006 code points
$ unichars -gas 'PL' 'p{Cased}'
$ unichars -gas 'P{MARK}' 'p{diacritic}' # 209 code points
$ unichars -gas 'pM' 'P{BC=NSM}'
$ unichars -gas 'p{Cased}' '[^p{CWL}p{CWT}p{CWU}]'
$ unichars -gas 'p{Dash}'
$ unichars -gas 'p{mark}' 'P{DIACRITIC}' # 1068 code points
$ unichars -gas 'grep { length > 1 } lc, ucfirst, uc'
$ unichars -gas 'uc ne ucfirst'
$ unichars -gasn NUM
下面是输出的一个例子:
$ unichars -gsn NUM 'int NUM ne NUM'
0 U+0030 GC=Nd 0=NV SC=Common DIGIT ZERO
¼ U+00BC GC=No 1/4=NV SC=Common VULGAR FRACTION ONE QUARTER
½ U+00BD GC=No 1/2=NV SC=Common VULGAR FRACTION ONE HALF
¾ U+00BE GC=No 3/4=NV SC=Common VULGAR FRACTION THREE QUARTERS
٠ U+0660 GC=Nd 0=NV SC=Common ARABIC-INDIC DIGIT ZERO
۰ U+06F0 GC=Nd 0=NV SC=Arabic EXTENDED ARABIC-INDIC DIGIT ZERO
߀ U+07C0 GC=Nd 0=NV SC=Nko NKO DIGIT ZERO
० U+0966 GC=Nd 0=NV SC=Devanagari DEVANAGARI DIGIT ZERO
০ U+09E6 GC=Nd 0=NV SC=Bengali BENGALI DIGIT ZERO
৴ U+09F4 GC=No 1/16=NV SC=Bengali BENGALI CURRENCY NUMERATOR ONE
৵ U+09F5 GC=No 1/8=NV SC=Bengali BENGALI CURRENCY NUMERATOR TWO
৶ U+09F6 GC=No 3/16=NV SC=Bengali BENGALI CURRENCY NUMERATOR THREE
৷ U+09F7 GC=No 1/4=NV SC=Bengali BENGALI CURRENCY NUMERATOR FOUR
৸ U+09F8 GC=No 3/4=NV SC=Bengali BENGALI CURRENCY NUMERATOR ONE LESS THAN THE DENOMINATOR
੦ U+0A66 GC=Nd 0=NV SC=Gurmukhi GURMUKHI DIGIT ZERO
૦ U+0AE6 GC=Nd 0=NV SC=Gujarati GUJARATI DIGIT ZERO
୦ U+0B66 GC=Nd 0=NV SC=Oriya ORIYA DIGIT ZERO
୲ U+0B72 GC=No 1/4=NV SC=Oriya ORIYA FRACTION ONE QUARTER
୳ U+0B73 GC=No 1/2=NV SC=Oriya ORIYA FRACTION ONE HALF
୴ U+0B74 GC=No 3/4=NV SC=Oriya ORIYA FRACTION THREE QUARTERS
୵ U+0B75 GC=No 1/16=NV SC=Oriya ORIYA FRACTION ONE SIXTEENTH
୶ U+0B76 GC=No 1/8=NV SC=Oriya ORIYA FRACTION ONE EIGHTH
୷ U+0B77 GC=No 3/16=NV SC=Oriya ORIYA FRACTION THREE SIXTEENTHS
等。
我在OSCON Unicode的第一个演讲中描述了这些。这些只是一套工具中的两个。
runpaint提供了一个unicode_data.txt接口,运行良好,但自称是"非常早期的草稿"