CollationinICU

上传人:xx****x 文档编号:242869058 上传时间:2024-09-10 格式:PPT 页数:45 大小:701.50KB
返回 下载 相关 举报
CollationinICU_第1页
第1页 / 共45页
CollationinICU_第2页
第2页 / 共45页
CollationinICU_第3页
第3页 / 共45页
点击查看更多>>
资源描述
Click to edit Master title style,Dublin, Ireland * * * * * * *,*,*,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,Collation in ICU,Mark Davis,Chief SW Globalization Architect,IBM,9/10/2024,1,What is ICU?,Premier Unicode Enablement Library,Open-source: non-viral license,Full-Featured, Cross-Platform,C, C+, Java APIs,Collation,Charset Conversion, Resources, Boundaries, Calendars, Transforms (case, norm., translit., ), Format/Parse (dates, times, msgs, nums., curr., ), Unicode strings/props,Collation = Sorting Order,How hard can it be?,A B C ,Complications,Languages are complex and varied,Unicode is a,big,set of characters,Performance is crucial,Varies By:,Language,Swedish: z ,German: z,Usage,Dictionary:,f of,Telephone:,of f,Customizations,A a,a A,Versioning,Fixes,New Gov. Stds,New Characters,Strength Levels: L1, L2, L3,Base characters: a b,Accents: as s at,ignored,if there is a L1 character difference,Case: ao Ao a,ignored,if there is a L1 or L2 difference,Punctuation: ab a-b aB,ignored,* if there is a L1, L2, or L3 difference,Tie-breaker: NFD code point order,Context Sensitivity,Contractions,H Z, but,CZ ,CH,Expansions,OE , OF,Both, ,Canonical Equivalence,A + ,x + . + x + + .,u + + . + u + . + u + + .,Oddities,Normal accents,cote c,o,t c,te ct,first,accent difference determines order,French accents,cote ct,e, cot, ct,last,accent difference determines order,Logical Order Exception (Thai, Lao),sorts like,Merging,Database Fields,F1 = LastName, F2 = FirstName,Sequential,Weak 1,st,Merged,F1, then F2,F1 (L1), F2,L1, L2, L3,diSilva,John,diSilva, Freddi Silva,John,di Silva, Freddsilva,John,dsilva, Fred,diSilva, John,dsilva, John,di Silva, John,di Silva, FreddiSilva, Fred,dsilva, Fred,diSilva, Johndi Silva, Johndsilva, JohndiSilva, Freddi Silva, Freddsilva, Fred,Customizations,Parameters that change collation behavior,Choice of language (locale),Runtime choices,Examples to follow,Parametric Customizations,Strength,Base,Base+Accent,Base+Accent+ Case,&c.,Case:,A a,a A,Punctuation:,di Silva diSilva,diSilva di Silva,Punctuation / Spaces (Alternates),Base Character,di silvadi SilvaDi silvaDi Silva,Dickens,disilvadiSilvaDisilvaDiSilva,Ignoreable,Dickens,di silvadisilvadi SilvadiSilvaDi silvaDisilvaDi SilvaDiSilva,Extended Customizations,User-defined,“&” “ampersand”,Merging tailorings,Iranian + French,Script Order,b , , b ,Numbers,A-10 A-2,A-2 A-10,Other Uses: String Searching,Match according to locale conventions:,e.g. w = v for Swedish,Use collation options:,ignore case, accent,other customizations,Other Uses: Selection Bounds,Return all records where:,Zo ,name, Zorma,Ignore case / accents,Zoe / zoe / Zo / zo / ,UCA,UTS #10: Unicode Collation Algorithm,Levels, Expansions, Contractions, Punctuation, Canonical Equivalence, etc.,Default ordering:,all,Unicode code points,Provides for,tailoring,to given languages,Also see:,The Unicode Standard, 5.17:,Sorting and Searching,Aligned with ISO 14651,APIs,String Compare,Sort Keys,String Search,Selection Boundaries,Merged sortkeys,Sort Keys,Transform string into series of bytes which will binary-compare,a:,06 C3,01,20,01,02,00,A:,06 C3,01,20,01,08,00,:,06 C3,01,20 32,01,02 02,00,a,b,:,06 C3 06 D7,01,20 20,01,02 02,00,b:,06 D7,01,20,01,02,00,Level 1,Level 2,Level 3,String Compare vs. Sort Keys,Same results in either case,SC faster for single comparisons,average 5 to 10 times!,SK faster for multiple comparisons,index once,binary compare,many,times,String Search,Nave Approach,key matches in target at ,iff,target.substring(x, y) key,Boundary Complications,Ignorables:,“a” matches in “(a)”?,at & & & ?,Contractions:,“c” matches in “churo”?,Normalization:,“” matches in “a,”?,WARNING 1: Basics,Not,aligned with character set or repertoire,Latin-1: Swedish and German sorting differs,Not,code point (binary) order,Binary:Z a v a,Swedish:v w,Not,a property of strings: same Database,Swedish user: views/select,German user: views/selects,WARNING 2: Operations,Order,not,preserved under concatenation / substringing,x y,x,z, y,z,x y,z,x ,z,y,x,z, y,z,x y,z,x ,z,y,x y,WARNING 3: Dependence,Collation is a relation over strings,Sort keys embody,part,of that relation,Thus, comparing sort keys from,different,tailorings (or parameters) gives undefined results.,WARNING 4: Stability,Stable Sort,Records with equal comparison come out in original order,Property of,algorithm, not comparison,Semi-Stable Comparison,x y,x,y,Property of,comparison, not algorithm,Degrades performance,Doesnt do what people think (or really want)!,ICU/Java Collation Architecture,L1-3, contractions, expansions, ,Locale tailorings,Fully rule-based specification,Arbitrary runtime user customizations,& ? = question mark,& $ = dollar sign,& z george,Java,Sun licensed and includes an early version of ICU collation in Java,ICU version:,Dramatically faster,Much reduced memory consumption,Halved sort-key length,Many additional features,ICU Collation I,Full UCA compliance,Full supplementary character support,Solid performance,Small Sort-Keys,Small Memory Footprint,ICU Collation II,Parametric control,Tailorable to any language,Simultaneous Multiple Versions,Merging Sort Keys,Selection Bounds,Memory-Mappable, Fast Init,Old: separate allocations,New: offsets within mem-map,Delta Tailoring:Minimize Memory Usage,FR,found,UCA:One Copy;,80K,not,found,code,not,synthesize,input,output,Simultaneous Multiple Versions,Programs can link against different versions of ICU, simultaneously.,Preserves,exact,binary order over time.,Application,NewDB,OldDB,ICU2.1,ICU2.0,Performance,Checks for identical prefixes first,Invokes normalization only when needed,Fast paths for common cases,Minimizes comparison time,Minimizes sort key length,Sort Key Compression,Common weights are 1-byte,Primary, secondary, tertiary, quarternary,Sequences are compressed,UTF-16 Values for “Mrk Davis”,(22 bytes),004D 00E4 0072 006B 0020 0044 0061 0076 0069 0073 0000,Sort Key,(L3, ignorable punctuation - 19 bytes),2F 17 39 2B 1D 17 41 27 3B,01,77 96 0A,01,8F 80 8F 07,00,ICU vs. Windows, glibc,Full UCA!,String comparison: comparable speed, -20% . +400%,Sort keys: much shorter, 50%,Warning: speed comparisons are approximate!,Depends on data, parameters, features, CPU,More Information,ICU,Design Document,Latest Version of these slides,Q & A,Fast C or D (FCD),Accepts all NFD, most NFC,without normalization,Backup Slides,Not used in the presentation, except in response to questions,Performance: Coding,Avoided unnecessary function calls.,Example: strlen too expensive!,Avoided use of objects,Rewrote core code in C,C+ API wraps the C core code.,Fast-pathed common cases,Used stack memory buffers,(with expansion if necessary),Made inner loops as tight as possible,WARNING 5: Math. Relation,S = Unicode Strings,Reflexive,a,S: a a,Antisymmetric,a, b,S: a b & b a,a = b,Transitive,a, b,S: a b & b c,a c,Total,a, b,S: a b,b a,Identical Prefixes,Sorting / Searching Databases,Many comparisons to “close” strings,Check initial prefixes with binary compare,Drop into collation loop at first difference,Complication,Initial Prefix Complication,Need to backup if in “bad” position:,Fractional UCA,Fractional weights for compression,Gaps for tailoring, future UCA additions,Only stores differences in tailoring file,Reduces memory footprint,Exceptional Values,Normal weight storage,Special Weight Storage,NOT_FOUND, EXPANSION, CONTRACTION, THAI, ,Minimal Memory,Flat-file (memory mapped),speeds initialization,reduces memory footprint,(next slide),Delta Tailoring,Single copy of UCA (80K),Small delta files per locale,
展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 图纸专区 > 大学资料


copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!