MethodologiesforEvaluatingDialogStructureAnnotation

上传人:sx****84 文档编号:243015063 上传时间:2024-09-13 格式:PPT 页数:21 大小:64KB
返回 下载 相关 举报
MethodologiesforEvaluatingDialogStructureAnnotation_第1页
第1页 / 共21页
MethodologiesforEvaluatingDialogStructureAnnotation_第2页
第2页 / 共21页
MethodologiesforEvaluatingDialogStructureAnnotation_第3页
第3页 / 共21页
点击查看更多>>
资源描述
,Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,*,Methodologies for EvaluatingDialog Structure Annotation,Ananlada Chotimongkol,Presented at Dialogs on Dialogs Reading Group,27 January 2006,1,Dialog structure annotation evaluation,How good is the annotated dialog structure?,Evaluation methodologies,Qualitative evaluation (humans rate how good it is),Compare against a gold standard (usually created by a human),Evaluate the end product (task-based evaluation),Evaluate the principles used,Inter-annotator agreement (comparing subjective judgment when there is no single correct answer),2,Choosing evaluation methodologies,Depended on what kind of information being annotated,Categorical annotation,e.g. dialog act,Boundary annotation,e.g. discourse segment,Structural annotation,e.g.,rhetorical structure,3,Categorical annotation evaluation,Cochrans Q test,Test whether the number of coders assigning the same label at each position is randomly distributed,Doesnt tell directly the degree of agreement,Percentage of agreement,Measures how often the coders agree,Doesnt account for agreement by chance,Kappa coefficient ,Carletta, 1996,Measures pairwise agreement among coders correcting for expected chance agreement,4,Kappa statistic,Kappa coefficient (K) measures pairwise agreement among coders on categorical judgment,P(A) is the proportion of times the coders agree,P(E) is the proportion of times they are expected to agree by chance,K 0.8 indicates substantial agreement,0.67 K 0.8 indicates moderate agreement,Difficult to calculate chance expected agreement in some cases,5,Boundary annotation evaluation,Use Kappa coefficient,Dont compare the segments directly but compare a decision on placing each boundary,At each eligible point, making a binary decision whether to annotate it as “boundary” or “non-boundary”,However, Kappa coefficient doesnt accommodate near-miss boundaries,Redefine a matching criterion e.g. also count near-miss as match,Use other metrics e.g. probabilistic error metrics,6,Probabilistic error metrics,P,k,Beeferman et al, 1999,Measure how likely two time points are classified into different segments,Small P,k,means high degree of agreement,WindowDiff (WD) ,Pevzner and Hearst, 2002,Measure the number of intervening topic breaks between time points,Penalize the difference in the number of segment boundaries between two time points,7,Structural annotation evaluation,Cascaded approach,Evaluate one level at a time,Evaluate the annotation of the higher level only if the annotation of the lower level is agreed,Example: nested game annotation in Map Task,Carletta et al, 1997,Redefine matching criteria for structural annotation Flammia and Zue, 1995,Segment A matches segment B if A contains B,Segment A in annotation-i matches with segments in annotation-j if segments in annotation-j excludes segment A,Agreement criterion isnt symmetry,Flattened the hierarchical structure,Flatten the hierarchy into overlapping spans,Compute agreement on the spans or spans labels,Example: RST annotation ,Marcu et al, 1999,8,Form-based dialog structure,Describe a dialog structure using a,task structure,: a hi,erarchical organization of domain information,Task,:,a subset of dialogs that has a specific goal,Sub-task,:,A decomposition of a task,Corresponds to one action (the process that uses related pieces of information together to create a new piece of information or a new dialog state),Concept,:,is a word or a group of words that captures information necessary for performing an action,Task structure is domain-dependent,9,An example of form-based structure annotation,word1 word2,word3,word4 wordn,word1,word2,word3 word4 wordn,10,Annotation experiment,Goal: to verify that,the form-based dialog structure can be understood and applied by other annotators,The subjects were asked to identify the task structure of the dialogs in two domains,Air travel planning domain,Map reading domain,Need a different set of labels for each domain,Equivalent to design domain-specific labels from the definition of dialog structure components,11,Annotation procedure,The subjects study an annotation guideline,Definition of the task structure,Examples from other domains (bus schedule and UAV flight simulation),For each domain, the subject study the transcription of 2-3 dialogs,Create a set of labels for annotating the task structure,Annotate the given dialogs with the set of labels designed in 1),12,Issues on task structure annotation evaluation,There are more than one acceptable annotation,Similar to MT evaluation,But difficult to obtain multiple references,The tag set used by two annotator may not be the same,two thirty,two thirty,Difficult to define matching criteria,Mapping equivalent labels between two tag sets is subjective (and may not be possible),13,Cross-annotator correction,Ask a different annotator (2,nd,annotator) to judge the annotation and make a correction on the part that doesnt conform to the guideline,If the 2,nd,annotator agrees with the 1,st,one, he will make no correction,The annotation of the 2,nd,annotator himself may be different because there can be more than one annotation that conform with the rule,14,Cross-annotator correction (2),Pro:,Easier to evaluate the agreement, the annotations are based on the same tag set,Allow more than one acceptable annotations,Con:,Need another annotator, take time,Another subjective judgment,Need to measure amount of change made by the 2,nd,annotator,15,Cross-annotators,Who should be the 2,nd,annotators,Another subject who did the annotation also,Bias toward his own annotation?,Another subject who studies the guideline but didnt do his/her own annotation,May not think about the structure thoroughly,Experts,Can also measure annotation accuracy using an expert annotation as a reference,16,How to quantify amount of correction,Edit distance from the original annotation,Structural annotation, have to redefine edit operations,Lower number means higher agreement, but which range of values is acceptable,Inter-annotator agreement,Can apply structural annotation evaluation,Agreement number is meaningful, can compare across different domain,17,Cross-annotation agreement,Use similar approach to ,Marcu et al, 1999,Flatten the hierarchy into overlapping spans,Compute agreement on the labels of the spans (task, sub-task, concept labels),Issues,A lot of possible spans with no label (esp. for concept annotation),How to calculate P(E) when add new concepts,18,Objective annotation evaluation,Make it more comparable to other works,Easier to evaluation, dont need the 2,nd,annotator,Label-insensitive,3 labels: , , ,May also consider the level of sub-tasks e.g. , ,Kappa artificially high,Add qualitative analysis on what they dont agree on,19,Reference,J. Carletta, Assessing agreement on classification tasks the kappa statistic,Computational Linguistics, vol. 22, pp. 249-254, 1996.,D. Beeferman, A. Berger, and J. Lafferty, Statistical Models for Text Segmentation,Machine Learning, vol. 34, pp. 177-210, 1999.,L. Pevzner and M. A. Hearst, A critique and improvement of an evaluation metric for text segmentation,Computational Linguistics, vol. 28, pp. 19-36, 2002.,J. Carletta, S. Isard, G. Doherty-Sneddon, A. Isard, J. C. Kowtko, and A. H. Anderson, The reliability of a dialogue structure coding scheme,Computational Linguistics, vol. 23, pp. 13-31, 1997.,G. Flammia and V. Zue, Empirical evaluation of human performance and agreement in parsing discourse constituents in spoken dialogue, in,the Proceedings of Eurospeech 1995,. Madrid, Spain, 1995.,D. Marcu, E. Amorrortu, and M. Romera, Experiments in constructing a corpus of discourse trees, in,the Proceedings of the ACL Workshop on Standards and Tools for Discourse Tagging, College Park, MD, 1999.,20,Matching criteria,Exact match (pairwise),Partial match (pairwise),Agree with majority (pool of coders),Agree with consensus (pool of coders),21,
展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 图纸专区 > 课件教案


copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!