Chapter9TextProcessing

上传人:xx****x 文档编号:242867845 上传时间:2024-09-10 格式:PPT 页数:31 大小:285KB
返回 下载 相关 举报
Chapter9TextProcessing_第1页
第1页 / 共31页
Chapter9TextProcessing_第2页
第2页 / 共31页
Chapter9TextProcessing_第3页
第3页 / 共31页
点击查看更多>>
资源描述
,Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,09-08-04,1.,*,Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,*,Chapter 9: Text Processing,Pattern MatchingData Compression,1,Outline and Reading,Strings (9.1.1),Pattern matching algorithms,Brute-force algorithm (9.1.2),Knuth-Morris-Pratt algorithm (9.1.4),Regular Expressions and Finite Automata,Data Compression,Huffman Coding,Lempel-Ziv Compression,2,Motivation: Bioinformatics,The application of computer science techniques to genetic data,See,Gene-Finding,notes,Many interesting algorithm problems,Many interesting ethical issues!,3,Strings,A string is a sequence of characters,Examples of strings:,Java program,HTML document,DNA sequence,Digitized image,An alphabet,S,is the set of possible characters for a family of strings,Example of alphabets:,ASCII,Unicode,0, 1,A, C, G, T,Let,P,be a string of size,m,A substring,P,i . j,of,P,is the subsequence of,P,consisting of the characters with ranks between,i,and,j,A prefix of,P,is a substring of the type,P,0,. i,A suffix of,P,is a substring of the type,P,i .m,-,1,Given strings,T,(text) and,P,(pattern), the pattern matching problem consists of finding a substring of,T,equal to,P,Applications:,Regular expressions,Programming languages,Search engines,Biological research,4,Pattern matching,Suppose you want to find repeated ATs followed by a G in GAGATATATATCATATG.,How do you express that pattern to find?,How can you find it efficiently?,How if the strings were billions of characters long?,5,Finite Automata and Regular Expressions,How do I match perl-like,regular expressions,to text?,Important topic: regular expressions and finite automata.,theoretician: regular expressions are grammars that define regular languages,programmer: compact patterns for matching and replacing,6,Regular Expressions,Regular expressions are one of,a literal character,a (regular expression) in parentheses,a concatenation of two REs,the alternation (,“,or,”,) of two REs, denoted + in formal notation,the closure of an RE, denoted * (ie 0 or more occurrences),Possibly additional syntactic sugar,Examples,abracadabra,abra(cadabra)* = abra, abracadabra, abracadabracadabra, ,(a*b + ac)d,(a(a+b)b*)*,t(w+o)?o? means 0 or 1 occurrence in Perl,aa+rdvark+ means 1 or more occurrences in Perl,7,Finite Automata,Regular language,: any language defined by a RE,Finite automata,: machines that recognize regular languages.,Deterministic Finite Automaton,(DFA):,a set of states including a start state and one or more accepting states,a transition function: given current state and input letter, what,s the new state?,Non-deterministic Finite Automaton,(NFA):,like a DFA, but there may be,more than one transition out of a state on the same letter (Pick the right one non-deterministically, i.e. via lucky guess!),epsilon-transitions, i.e. optional transitions on no input letter,8,REs in common use,Syntactic sugar:,a-c,x-z: match one of a, b, c, x, y, z,abc: match a character that is not an a, b, or c,.: match any character,?: match 0 or 1 instances of what preceeded,s: match a whitespace character, $: match the beginning or end of string,(pattern): make pattern available in substitutions as $1, $2, etc,9,Examples,Perl examples (and other languages):,$input = s/two?o/2/;,$input = s|*s*|gs;,$input = s|s*font-faces*.*?|gs;,$input = s|s*mso-*|gis;,$input = s/( +) +( +)/$2 $1/;,$input = m/0-9+.?0-9*|.0-9+$/;,($word1,$word2,$rest) =,($foo = m/ *( +) +( +) +(.*)$/);,$input=s|*s*s*|gis;,10,Multiples of 3?,/(0369|2580369*147|147(0369|1470369*258)*258|2580369*258(0369|1470369*258)*258|147(0369|1470369*258)*1470369*147|2580369*258(0369|1470369*258)*1470369*147)*$/,DFA for (AT)+C,Note that DFA can be represented as a 2D array, DFAstateinputLetter,newstate,DFA:,state letter newstate,0A 1,0 TCG 0,1 T 2,1 ACG 0,2 C 4 accept,2 GT 0,2 A 3,3 T 2,3 AGC 0,4 AGCT 0,12,RE,NFA,Given a Regular Expression, how can I build a DFA?,Work bottom up.,Letter:,Concatenation:,Or:,Closure:,13,RE,NFA Example,Construct an NFA for the RE(A*B + AC)D,A,A*,A*B,A*B + AC,(A*B + AC)D,14,NFA - DFA,Keep track of the set of states you are in.,On each new input letter, compute the,new,set of states you could be in.,The set of states for the DFA is the,power set,of the NFA states.,I.e. up to 2,n,states, where there were n in the DFA.,15,Recognizing Regular Languages,Suppose your language is given by a DFA. How to recognize?,Build a table. One row for every (state, input letter) pair. Give resulting state.,For each letter of input string, compute new state,When done, check whether the last state is an accepting state.,Runtime?,O(n), where n is the number of input letters,Another approach: use a C program to simulate NFA with backtracking. Less space, more time.,16,Data Compression: Intro,Suppose you have a text, abracadabra. Want to compress it.,How many bits required?,at 3 bits per letter, 33 bits.,Can we do better?,How about,variable length codes?,In order to be able to decode the , we would need a,prefix code:,no code is the prefix of another.,How do we make a prefix code that compresses the text?,17,Huffman Coding,Note: Put the letters at the leaves of a binary tree. Left=0, Right=1. Voila! A prefix code.,Huffman coding: an,optimal,prefix code,Algorithm: use a priority queue.,insert all letters according to frequency,if there is only one tree left, done.,else, a=deleteMin(); b=deleteMin();,make tree t out of a and b with weight a.weight() + b.weight();,insert(t),18,Huffman coding example,abracadabra frequencies:,a: 5, b: 2, c: 1, d: 1, r: 2,Huffman code:,a: 0, b: 100, c: 1010, d: 1011, r: 11,bits: 5 * 1 + 2 * 3 + 1 * 4 + 1 * 4 + 2 * 2 = 23,Follow the tree to decode ,Q,(n),Time to encode?,Compute frequencies O(n),Build heap O(1) assuming alphabet has constant size,Encode O(n),19,Huffman coding summary,Huffman coding is very frequently used,(You use it every time you watch HTDV or listen to mp3, for example),Text files often compress to 60% of original size (depending on entropy),In real life, Huffman coding is usually used in conjunction with a,modeling,algorithm,20,Data compression overview,Two stages:,modeling,and,entropy coding,Modeling:,break up input into tokens or chunks (the bigger, the better),Entropy Coding:,use shorter bit strings to represent more frequent tokens,If P is the probability of a code element, the optimal number of bits is lg(P),21,Lempel-Ziv Modeling,Consider compressing text,Certain byte strings are more frequent than others: the, and, tion, es, etc. Model these with single tokens,Build a,dictionary,of the byte strings you see; the second time you see a byte string, use the dictionary entry,22,Lempel-Ziv Compression,Start with a dictionary of 256 entries for the first 256 characters,At each step,Output the code of the longest dictionary match and delete those characters from input,Add previous token plus last letter as new dictionary entry with code 256, 257, 258, ,Note that code lengths grow by one bit as dictionary reaches size 512, 1024, 2048, etc.,23,Lempel-Ziv Example,ABRACADABRA,Output Add to Dictionary,1 (A),2 (B) AB,5 (R) BR,1 (A) RA,3 (C) AC,1 (A) CA,4 (D) AD,6 (AB) DA,8 (RA) ABR,Dictionary:,A,B,C,D,R,AB,BR,RA,AC,CA,AD,DA,ABR,24,Lempel-Ziv Variations,All compression algorithms like zip, gzip use variations on Lempel-Ziv,Possible variations:,Fixed-length vs. variable length codes or adaptive Huffman or arithmetic coding,Don,t add duplicate entries to the dictionary,Limit the number of codes or switch to larger ones as needed,Delete less frequent dictionary entries or give frequent entries shorter codes,25,How about this approach:,Repeat,for each letter pair occurring in the text, try:,replace the pair with a single new token,measure the total entropy (Huffman-compressed size) of the file,if that letter pair resulted in the greatest reduction in entropy so far, remember it,permanently substitute new token for the pair that caused the greatest reduction in entropy,until no more reductions in entropy are possible,Results: compression to about 25% for big books: better than gzip, zip. But not as good as bzip!,26,Compression other data,Modeling for audio?,Modeling for images?,27,Modeling for Images?,Wikipedia,28,JPEG, etc.,Modeling: convert to the frequency domain with DCT,Throw away some high-frequency components,Throw away imperceptible components,Quantize coefficients,Encode the remaining coefficients with Huffman coding,Results: up to 20-1 compressionwith good results, 100-1 with recognizable results,How the DCT changed the world,29,Data compression results,Best algorithms compress text to 25% of original size, but humans can compress to 10%,Humans have far better modeling algorithms because they have better pattern recognition and higher-level patterns to recognize,Intelligence pattern recognition data compression?,Going further:,Data-C,30,Ethical issues on algorithms,Back to an issue from the start of class: Can algorithms be unethical?,
展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 图纸专区 > 大学资料


copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!