资源描述
Click to edit Master title style,Dublin, Ireland * * * * * * *,*,*,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,Collation in ICU,Mark Davis,Chief SW Globalization Architect,IBM,9/10/2024,1,What is ICU?,Premier Unicode Enablement Library,Open-source: non-viral license,Full-Featured, Cross-Platform,C, C+, Java APIs,Collation,Charset Conversion, Resources, Boundaries, Calendars, Transforms (case, norm., translit., ), Format/Parse (dates, times, msgs, nums., curr., ), Unicode strings/props,Collation = Sorting Order,How hard can it be?,A B C ,Complications,Languages are complex and varied,Unicode is a,big,set of characters,Performance is crucial,Varies By:,Language,Swedish: z ,German: z,Usage,Dictionary:,f of,Telephone:,of f,Customizations,A a,a A,Versioning,Fixes,New Gov. Stds,New Characters,Strength Levels: L1, L2, L3,Base characters: a b,Accents: as s at,ignored,if there is a L1 character difference,Case: ao Ao a,ignored,if there is a L1 or L2 difference,Punctuation: ab a-b aB,ignored,* if there is a L1, L2, or L3 difference,Tie-breaker: NFD code point order,Context Sensitivity,Contractions,H Z, but,CZ ,CH,Expansions,OE , OF,Both, ,Canonical Equivalence,A + ,x + . + x + + .,u + + . + u + . + u + + .,Oddities,Normal accents,cote c,o,t c,te ct,first,accent difference determines order,French accents,cote ct,e, cot, ct,last,accent difference determines order,Logical Order Exception (Thai, Lao),sorts like,Merging,Database Fields,F1 = LastName, F2 = FirstName,Sequential,Weak 1,st,Merged,F1, then F2,F1 (L1), F2,L1, L2, L3,diSilva,John,diSilva, Freddi Silva,John,di Silva, Freddsilva,John,dsilva, Fred,diSilva, John,dsilva, John,di Silva, John,di Silva, FreddiSilva, Fred,dsilva, Fred,diSilva, Johndi Silva, Johndsilva, JohndiSilva, Freddi Silva, Freddsilva, Fred,Customizations,Parameters that change collation behavior,Choice of language (locale),Runtime choices,Examples to follow,Parametric Customizations,Strength,Base,Base+Accent,Base+Accent+ Case,&c.,Case:,A a,a A,Punctuation:,di Silva diSilva,diSilva di Silva,Punctuation / Spaces (Alternates),Base Character,di silvadi SilvaDi silvaDi Silva,Dickens,disilvadiSilvaDisilvaDiSilva,Ignoreable,Dickens,di silvadisilvadi SilvadiSilvaDi silvaDisilvaDi SilvaDiSilva,Extended Customizations,User-defined,“&” “ampersand”,Merging tailorings,Iranian + French,Script Order,b , , b ,Numbers,A-10 A-2,A-2 A-10,Other Uses: String Searching,Match according to locale conventions:,e.g. w = v for Swedish,Use collation options:,ignore case, accent,other customizations,Other Uses: Selection Bounds,Return all records where:,Zo ,name, Zorma,Ignore case / accents,Zoe / zoe / Zo / zo / ,UCA,UTS #10: Unicode Collation Algorithm,Levels, Expansions, Contractions, Punctuation, Canonical Equivalence, etc.,Default ordering:,all,Unicode code points,Provides for,tailoring,to given languages,Also see:,The Unicode Standard, 5.17:,Sorting and Searching,Aligned with ISO 14651,APIs,String Compare,Sort Keys,String Search,Selection Boundaries,Merged sortkeys,Sort Keys,Transform string into series of bytes which will binary-compare,a:,06 C3,01,20,01,02,00,A:,06 C3,01,20,01,08,00,:,06 C3,01,20 32,01,02 02,00,a,b,:,06 C3 06 D7,01,20 20,01,02 02,00,b:,06 D7,01,20,01,02,00,Level 1,Level 2,Level 3,String Compare vs. Sort Keys,Same results in either case,SC faster for single comparisons,average 5 to 10 times!,SK faster for multiple comparisons,index once,binary compare,many,times,String Search,Nave Approach,key matches in target at ,iff,target.substring(x, y) key,Boundary Complications,Ignorables:,“a” matches in “(a)”?,at & & & ?,Contractions:,“c” matches in “churo”?,Normalization:,“” matches in “a,”?,WARNING 1: Basics,Not,aligned with character set or repertoire,Latin-1: Swedish and German sorting differs,Not,code point (binary) order,Binary:Z a v a,Swedish:v w,Not,a property of strings: same Database,Swedish user: views/select,German user: views/selects,WARNING 2: Operations,Order,not,preserved under concatenation / substringing,x y,x,z, y,z,x y,z,x ,z,y,x,z, y,z,x y,z,x ,z,y,x y,WARNING 3: Dependence,Collation is a relation over strings,Sort keys embody,part,of that relation,Thus, comparing sort keys from,different,tailorings (or parameters) gives undefined results.,WARNING 4: Stability,Stable Sort,Records with equal comparison come out in original order,Property of,algorithm, not comparison,Semi-Stable Comparison,x y,x,y,Property of,comparison, not algorithm,Degrades performance,Doesnt do what people think (or really want)!,ICU/Java Collation Architecture,L1-3, contractions, expansions, ,Locale tailorings,Fully rule-based specification,Arbitrary runtime user customizations,& ? = question mark,& $ = dollar sign,& z george,Java,Sun licensed and includes an early version of ICU collation in Java,ICU version:,Dramatically faster,Much reduced memory consumption,Halved sort-key length,Many additional features,ICU Collation I,Full UCA compliance,Full supplementary character support,Solid performance,Small Sort-Keys,Small Memory Footprint,ICU Collation II,Parametric control,Tailorable to any language,Simultaneous Multiple Versions,Merging Sort Keys,Selection Bounds,Memory-Mappable, Fast Init,Old: separate allocations,New: offsets within mem-map,Delta Tailoring:Minimize Memory Usage,FR,found,UCA:One Copy;,80K,not,found,code,not,synthesize,input,output,Simultaneous Multiple Versions,Programs can link against different versions of ICU, simultaneously.,Preserves,exact,binary order over time.,Application,NewDB,OldDB,ICU2.1,ICU2.0,Performance,Checks for identical prefixes first,Invokes normalization only when needed,Fast paths for common cases,Minimizes comparison time,Minimizes sort key length,Sort Key Compression,Common weights are 1-byte,Primary, secondary, tertiary, quarternary,Sequences are compressed,UTF-16 Values for “Mrk Davis”,(22 bytes),004D 00E4 0072 006B 0020 0044 0061 0076 0069 0073 0000,Sort Key,(L3, ignorable punctuation - 19 bytes),2F 17 39 2B 1D 17 41 27 3B,01,77 96 0A,01,8F 80 8F 07,00,ICU vs. Windows, glibc,Full UCA!,String comparison: comparable speed, -20% . +400%,Sort keys: much shorter, 50%,Warning: speed comparisons are approximate!,Depends on data, parameters, features, CPU,More Information,ICU,Design Document,Latest Version of these slides,Q & A,Fast C or D (FCD),Accepts all NFD, most NFC,without normalization,Backup Slides,Not used in the presentation, except in response to questions,Performance: Coding,Avoided unnecessary function calls.,Example: strlen too expensive!,Avoided use of objects,Rewrote core code in C,C+ API wraps the C core code.,Fast-pathed common cases,Used stack memory buffers,(with expansion if necessary),Made inner loops as tight as possible,WARNING 5: Math. Relation,S = Unicode Strings,Reflexive,a,S: a a,Antisymmetric,a, b,S: a b & b a,a = b,Transitive,a, b,S: a b & b c,a c,Total,a, b,S: a b,b a,Identical Prefixes,Sorting / Searching Databases,Many comparisons to “close” strings,Check initial prefixes with binary compare,Drop into collation loop at first difference,Complication,Initial Prefix Complication,Need to backup if in “bad” position:,Fractional UCA,Fractional weights for compression,Gaps for tailoring, future UCA additions,Only stores differences in tailoring file,Reduces memory footprint,Exceptional Values,Normal weight storage,Special Weight Storage,NOT_FOUND, EXPANSION, CONTRACTION, THAI, ,Minimal Memory,Flat-file (memory mapped),speeds initialization,reduces memory footprint,(next slide),Delta Tailoring,Single copy of UCA (80K),Small delta files per locale,
展开阅读全文