资源描述
Click to edit Master title style,*,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,Linguistics 187: Grammar Engineering,Ron Kaplan, Tracy King, Martin Forst,Administrivia,Schedule: Office hours,Requirements,Overview,Semantic Search:,Powerset,Hakia,Applications of Language Engineering,Functionality,Domain Coverage,Low,Narrow,Broad,High,Deep,Shallow,Synthesis,Keyword Search:,Google,Yahoo,Microsoft Live,Post-Search,Sifting,AutonomousKnowledge Filtering,NaturalDialogue,Microsoft Paperclip,Manually-tagged Keyword Search,Document BaseManagement,Restricted,Dialogue,Useful Summary,Good Translation,Grammar engineering for deep processing,Draws on theoretical linguistics, software engineering,Theoretical linguistics = papers,Generalizations, universality, idealization (competence),Software engineering = programs,Coverage, interface, QA, maintainability, efficiency, practicality,Grammar engineering,Grammar:Theory = Program:Programming language,Reflect linguistic generalizations,Respect special cases of ordinary language,Deal with large-scale interactions,Theory/practice trade-offs,What is a shallow grammar,often trained automatically from marked up corpora,part of speech tagging,chunking,trees,POS tagging and Chunking,Part of speech tagging:,I,/PRP,saw,/VBD,her,/PRP,duck,/VB,./PUNCT,I,/PRP,saw,/VBD,her,/PRP$,duck,/NN,./PUNCT,Chunking:,general chunking,I begin with an intuition: when I read a sentence, I read it a chunk at a time. (Abney),NP chunking,NP,President Obama,visited,NP,the Hermitage,in,NP,Leningrad,Treebank grammars,Phrase structure tree (c-structure),Annotations for heads, grammatical functions,Collins parser output,Deep grammars,Provide detailed syntactic/semantic analyses,LFG (ParGram), HPSG (LinGO, Matrix),Grammatical functions, tense, number, etc.,Mary wants to leave.,subj,(want1,Mary3),comp,(want1,leave2),subj,(leave2,Mary3),tense,(want1,present),Usually manually constructed,linguistically motivated rules,Why would you want one,Meaning sensitive applications,overkill for many NLP applications, crucial for others,Applications which use shallow methods for English may not work for free word order languages,can read many functions off of trees in English,SUBJ,: NP sister to VP,S NP,Mary, VP,left,OBJ,: first NP sister to V,S NP,Mary, VP,saw,NP,John,need other information in German, Japanese, etc.,Deep analysis matters,if you care about the answer,Example:,A delegation led by Vice President Philips, head of the chemical division, f,lew to Chicago a week after the incident.,Question: Who flew to Chicago?,Candidate answers:,division,closest noun,head,next closest,V.P.,Philips,next,shallow but wrong,delegation,furthest away but,Subject of,flew,deep and right,Search: Keywords to natural language,Suppose you want to know who Obama criticized.,With shallow keyword search engines:,Keywords: “Obama criticized,Simple to use, but,Precison errors,Hillary and John criticized Barack, interesting (maybe) but irrelevant,Recall errors,What about denounce, condemn,Advanced search: More expressive, but complex and unused,(“Obama (criticize OR condemn OR ),Compensate with web graph and other ranking features,Who did Obama criticize?,Who did Obama criticize?,Who criticized Obama?,from,subj,by,Sir Edward Heath (name),pneumonia (noun),die (verb),Sir Edward Heath died from pneumonia .,Sir Edward Heath (noun),UK Prime Minister,politician,Parses each sentence on the page,Extracts entities & semantic relationships,Identifies and expands to similar entities, relationships & abstractions,Indexes multiple facts for each sentence,Semantic search (Powerset),disease,killed,Mapping Queries to Content,Edward Heaths death,death of Edward Heath,disease that killed Edward Heath,diseases that killed politicians,politicians who died from disease,politicians that died from pneumonia,politicians killed by pneumonia,who died from pneumonia,what politicians died from disease,which politician died from pneumonia,what disease did Edward Heath die from,what killed Sir Edward Heath,what was Sir Edward Heath killed by,Sir Edward Heath died from pneumonia at 19:30 on 17 July 2005,Acquisition:,manual + ML,Open-textcontent,NLquestions,ContentSemantics,Content Acquisition,User search,Ranking,XLE parse,Semantic map,XLE parse,Semantic map,Indexing,Query,Resultpresentation,Large-scalesemantic index,Retrieval,Who did IBM acquire in the last 10 years?,IBM purchased Lotus in 1998.,QuestionSemantics,Knowledge Resources,LFG Grammar,Doc1: IBM purchased Lotus in 1998.Doc2: List of IBM purchases,Traditional Problems,Time consuming and expensive to write,Robustness,want output for any input: real-world applications,Ambiguity,Efficiency,Interfaces to other application components,Why deep analysis is difficult,Languages are,hard to describe,Meaning depends on complex properties of words and sequences,Different languages rely on different properties,Errors and disfluencies,Languages are,hard to compute,Expensive to recognize complex patterns,Sentences are ambiguous,Ambiguities multiply: explosion in time and space,How to overcome this,Engineer the deep grammars,theoretical vs. practical,what is good enough,Integrate shallow techniques into deep grammars,Experience based on broad-coverage LFG grammars (ParGram project),Robustness: Sources of Brittleness,missing vocabulary,you cant list all the proper names in the world,missing constructions,there are many constructions theoretical linguistics rarely considers (e.g. dates, company names),easy to miss even core constructions,ungrammatical input,real world text is not always perfect,sometimes it is really horrendous,Real world Input,Other weak blue-chip issues included Chevron, which went down 2 to 64 7/8 in Big Board composite trading of 1.3 million shares; Goodyear Tire & Rubber, off 1 1/2 to 46 3/4, and American Express, down 3/4 to 37 1/4. (WSJ, section 13),The croakers done gone from the hook (WSJ, section 13),(SOLUTION 27000 20) Without tag P-248 the W7F3 fuse is located in the rear of the machine by the charge power supply (PL3 C14 item 15. (Eureka copier repair tip),Missing vocabulary,Build vocabulary based on the input of shallow methods,fast,extensive,accurate,Finite-state morphologies,Part of Speech Taggers,LFG and XLE: This course,LFG: a theory of grammar,XLE: a parsing/generation engine for LFG grammars,English,Group, order,Japanese,Group, mark,The small children are chasing the dog.,P,ga,Sbj,S,NP,N,Adj,NP,tiisaismall,kodomotatichildren,N,inudog,V,oikaketeiruare chasing,o,Obj,P,Different patterns code same meaning,S,NP,N,Adj,Det,V,NP,the,small,Aux,children,Det,the,N,dog,are,V,chasing,English,Group, order,Japanese,Group, mark,The small children are chasing the dog.,P,ga,Sbj,S,NP,N,Adj,NP,tiisaismall,kodomotatichildren,N,inudog,V,oikaketeiruare chasing,o,Obj,P,Different patterns code same meaning,S,NP,N,Adj,Det,V,NP,the,small,Aux,children,Det,the,N,dog,are,V,chasing,Warlpiri,Mark only,S,NP,N,NP,witajarra,rlu,small-,Sbj,mali,ki,dog-,Obj,N,kurdujarra,rlu,children-,Sbj,V,wajilipinyichase,Aux,kapalaPresent,NP,A,Chase(small(children), dog),Pred,chase,Subj,Obj,Tense,Present,Pred,Mod,children,small,Pred,dog,LFG theory: minor adjustments on universal theme,LFG architecture,C-structures and f-structures in piecewise correspondence.,NP,John,VP,NP,Mary,f,V,likes,S,Formal encoding of grammatical relations,Formal encoding of order and grouping,Modularity,SUBJ,PRED John,NUM SG,TENSE,PRESENT,PRED Mary,NUM SG,OBJ,PRED,like,LFG,grammar,Rules,S,NPVP(,SUBJ)=,=,Lexical entries,John:NP (,PRED)=John (,NUM)=SG,likes:V (,PRED)=like (,SUBJ NUM)=SG,(,SUBJ PERS)=3,Context-free rules define valid c-structures (trees).,Annotations are instantiated at tree nodes to give equational constraints that corresponding f-structures must satisfy.,Satisfiability of constraints determines grammaticality.,F-structure is solution for equations (if satisfied).,VP,VNP,=,(,OBJ)=,Rules as well-formedness conditions,S,NP,(, SUBJ)=,VP,=,S,NP,VP,SUBJ ,A tree containing S over NP - VP is OK if,F-unit corresponding to NP node is SUBJ of f-unit corresponding to S node,The same f-unit corresponds to both S and VP nodes.,s,s,s,be the f-structure of the Subject,f,f,Let,f,be the f-structure of the sentence,(,f,SUBJ NUM)=PL and (,f,SUBJ NUM)=SG,= SG=PL,=,FALSE,v,v,be the f-structure of the verb,v,NP(, SUBJ)=,walk,s,(, SUBJ NUM)=SG,S,VP,=,they(, NUM)=PL,Inconsistent equations = Ungrammatical,Whats wrong with They walk,s,?,(,f,SUBJ) =,s,and (,s,NUM)=PL = (,f,SUBJ NUM)=PL,Then (substituting equals for equals):,f,=,v,and (,v,SUBJ NUM)=SG = (,f,SUBJ NUM)=SG,If a valid inference chain yields FALSE,the premises are unsatisfiable.,Pargram project,Large-scale LFG grammars for several languages,English, German, Japanese (Korean), French, Norwegian, Chinese, Turkish, Arabic, Hungarian,Cover real uses of language-newspapers, documents, etc.,Parallelism: test LFG universality claims,Common c- to f-structure mapping conventions,(unless typologically motivated variation),Invariant underlying f-structures,Permits shared disambiguation properties, Glue interpretation premises,All grammars run on PARC software (XLE),International consortium of linguists,PARC, Stuttgart, Fuji Xerox, Konstanz, Bergen, Sabanci, Oxford, Oman,Sustained effort-full-week meetings twice a year10 years!,Contributions to linguistics and computational linguistics: books and papers,Each group is self-funded, self-managed,History,Started in 1994,English (PARC),French (XRCE, now PARC),German (IMS-Stuttgart),Biannual meetings,Alternating between Palo Alto and Europe/Japan,1998: Japanese started (Fuji Xerox),1999: Norwegian started (University of Bergen),2000: Urdu (Konstanz),2002: Danish started (Copenhagen),2003: Korean (PARC) porting experiment,2004: Welsh, Malagasy (Essex, Oxford) Chinese (PARC),2005: Arabic (Oman), Turkish (Sabanci), Hungarian,Goals,Practical,Create a capability/platform for NL applications,translation, information retrieval, .,Develop discipline of grammar engineering,what tools, techniques, conventions make it easy to develop and maintain broad-coverage grammars?,how long does it take?,how much does it cost?,Theoretical,Refine and guide LFG theory through broad coverage of multiple languages,Refine and guide the algorithms and implementation (XLE),Parallel f-structures (where possible),but different c-structures,Pargram grammars,German,English*,French,Japanese (Korean),#Rules,251,388,180,56,#States,3,239,13,655,1,747,368,#Disjuncts,13,294,55,725,12,188,2,012,* English allows for shallow markup: labeled bracketing, named-entities,Engineering results,Grammars and Lexicons,Parallel f-structures across languages,Grammar writers cookbook,New practical formal devices,Complex categories for efficiency,NPnom vs. NP: (,CASE) = NOM,Optimality marks for robustness,enlarge grammar without being overrun by peculiar analyses,Lexical priority: merging different lexicons,11/19/2024,Theoretical results,New theory of agreement features,Separate representation of morphosyntactic features,Phonology-syntax interface,New analysis of nonconstituent coordination,Distribution instead of generalization over sets,XLE Demo,
展开阅读全文