生物信息学两两序列比对课件

上传人:仙*** 文档编号:241596630 上传时间:2024-07-08 格式:PPT 页数:105 大小:4.35MB
返回 下载 相关 举报
生物信息学两两序列比对课件_第1页
第1页 / 共105页
生物信息学两两序列比对课件_第2页
第2页 / 共105页
生物信息学两两序列比对课件_第3页
第3页 / 共105页
点击查看更多>>
资源描述
大家好大家好November 22,2010Pairwise sequence alignmentJonathan Pevsner,Ph.D.BioinformaticsJohns Hopkins School Med.Many of the images in this powerpoint presentationare from Bioinformatics and Functional Genomicsby Jonathan Pevsner(ISBN 0-471-21004-8).Copyright 2009 by John Wiley&Sons,Inc.These images and materials may not be usedwithout permission from the publisher.We welcomeinstructors to use these powerpoints for educationalpurposes,but please acknowledge the source.The book has a homepage at http:/www.bioinfbook.orgincluding hyperlinks to the book chapters.Copyright noticeAnnouncementsThe moodle quiz from lecture 1 is due one week laterby today at noon.After then the quiz“closes”and wont be available to you.The quiz from todays lecture(“opens”at 10:30 am)is due in one week later at noon.Because of the Thanksgiving break,Im extending the deadline a day to Tuesday November 30(5:00 pm).Outline:pairwise alignment Overview and examples Definitions:homologs,paralogs,orthologs Assigning scores to aligned amino acids:Dayhoffs PAM matrices Alignment algorithms:Needleman-Wunsch,Smith-WatermanLearning objectives Define homologs,paralogs,orthologs Perform pairwise alignments(NCBI BLAST)Understand how scores are assigned to aligned amino acids using Dayhoffs PAM matrices Explain how the Needleman-Wunsch algorithm performs global pairwise alignmentsPairwise alignments in the 1950sb b-corticotropin(sheep)Corticotropin A(pig)ala gly glu asp asp gluasp gly ala glu asp gluOxytocinVasopressinCYIQNCPLGCYFQNCPRGPage 46Early example of sequence alignment:globins(1961)H.C.Watson and J.C.Kendrew,“Comparison Between the Amino-Acid Sequences of Sperm Whale Myoglobin and of Human Hmoglobin.”Nature 190:670-672,1961.myoglobina-b-globins:It is used to decide if two proteins(or genes)are related structurally or functionally It is used to identify domains or motifs that are shared between proteins It is the basis of BLAST searching(next week)It is used in the analysis of genomesPairwise sequence alignment is the most fundamental operation of bioinformaticsPage 47Pairwise alignment:protein sequencescan be more informative than DNA protein is more informative(20 vs 4 characters);many amino acids share related biophysical properties codons are degenerate:changes in the third position often do not alter the amino acid that is specified protein sequences offer a longer“look-back”time DNA sequences can be translated into protein,and then used in pairwise alignmentsPage 54Pairwise alignment:protein sequencescan be more informative than DNA Many times,DNA alignments are appropriate-to confirm the identity of a cDNA-to study noncoding regions of DNA-to study DNA polymorphisms-example:Neanderthal vs modern human DNAQuery:181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240|Sbjct:189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247Outline:pairwise alignment Overview and examples Definitions:homologs,paralogs,orthologs Assigning scores to aligned amino acids:Dayhoffs PAM matrices Alignment algorithms:Needleman-Wunsch,Smith-WatermanPairwise alignment The process of lining up two sequences to achieve maximal levels of identity(and conservation,in the case of amino acid sequences)for the purpose of assessing the degree of similarity and the possibility of homology.Definition:pairwise alignmentPage 53HomologySimilarity attributed to descent from a common ancestor.Definition:homologyPage 49Beta globin(NP_000509)2HHBPage 49myoglobin(NP_005359)2MM1Orthologs Homologous sequences in different species that arose from a common ancestral gene during speciation;may or may not be responsible for a similar function.Paralogs Homologous sequences within a single species that arose by gene duplication.Definitions:two types of homology Page 49Orthologs:members of a gene(protein)family in variousorganisms.This tree showsglobin orthologs.Page 51You can view these sequences at www.bioinfbook.org(document 3.1)Paralogs:members of a gene(protein)family within aspecies.This tree shows human globin paralogs.Page 52Orthologs and paralogs are often viewed in a single treeSource:NCBIGeneral approach to pairwise alignment Choose two sequences Select an algorithm that generates a score Allow gaps(insertions,deletions)Score reflects degree of similarity Alignments can be global or local Estimate probability that the alignment occurred by chanceCalculation of an alignment scoreSource:http:/www.ncbi.nlm.nih.gov/Education/BLASTinfo/Alignment_Scores2.htmlFind BLAST from the home page of NCBI and select protein BLASTPage 52Page 52Choose align two or more sequencesEnter the two sequences(as accession numbers or in the fasta format)and click BLAST.Optionally select“Algorithm parameters”and note the matrix option.Pairwise alignment result of human beta globin and myoglobinMyoglobin RefSeqQuery=HBBSubject=MBMiddle row displays identities;+sign for similar matchesInformation about this alignment:score,expect value,identities,positives,gapsPage 53Pairwise alignment result of human beta globin and myoglobin:the score is a sum of match,mismatch,gap creation,and gap extension scoresPage 53Pairwise alignment result of human beta globin and myoglobin:the score is a sum of match,mismatch,gap creation,and gap extension scoresPage 53V matching V earns+4These scores come fromT matching L earns-1a“scoring matrix”!HomologySimilarity attributed to descent from a common ancestor.Definitions:homologyPage 50Definitions:identity,similarity,conservationIdentityThe extent to which two(nucleotide or amino acid)sequences are invariant.Page 51SimilarityThe extent to which nucleotide or protein sequences are related.It is based upon identity plus conservation.Conservation Changes at a specific position of an amino acid or(less commonly,DNA)sequence that preserve the physico-chemical properties of the original residue.Pairwise alignment The process of lining up two sequences to achieve maximal levels of identity(and conservation,for amino acid sequences)for the purpose of assessing the degree of similarity and the possibility of homology.Definition:pairwise alignmentPage 53Mind the gapsPage 55First gap position scores-11Second gap position scores-1Gap creation tends to have a large negative score;Gap extension involves a small penalty Positions at which a letter is paired with a null are called gaps.Gap scores are typically negative.Since a single mutational event may cause the insertion or deletion of more than one residue,the presence of a gap is ascribed more significance than the length of the gap.Thus there are separate penalties for gap creation and gap extension.In BLAST,it is rarely necessary to change gap values from the default.Gaps 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP .|.|.|:.|.:|:1.MKCLLLALALTCGAQALIVT.QTMKGLDIQKVAGTWYSLAMAASD.44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD.VCADMVGTFTDTE 97 RBP :|:|.|.|:|.45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV.QYSC 136 RBP|.|:.|.|94 IPAVFKIDALNENKVL.VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP .|:|.|136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI.178 lactoglobulinPairwise alignment of retinol-binding protein and b b-lactoglobulin:Example of an alignment with internal,terminal gaps 1.MKWVWALLLLA.AWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48 :|.|.|.|:|:.|:.|.|1 MLRICVALCALATCWA.QDCQVSNIQVMQNFDRSRYTGRWYAVAKKDP 47 .49 EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98|:|:|.|.|.|:|:.|.|48 VGLFLLDNVVAQFSVDESGKMTATAHGRVIILNNWEMCANMFGTFEDTPD 97 .99 PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148|:|:|:|:|.|98 PAKFKMRYWGAASYLQTGNDDHWVIDTDYDNYAIHYSCREVDLDGTCLDG 147 .149 YSFVFSRDPNGLPPEAQKIVRQRQEELCLARQYRLIVHNGYCDGRSERNLL 199|:|:.|:|.|:|:|:148 YSFIFSRHPTGLRPEDQKIVTDKKKEICFLGKYRRVGHTGFCESS.192Pairwise alignment of retinol-binding protein from human(top)and rainbow trout(O.mykiss):Example of an alignment with few gaps43210Pairwise sequence alignment allows usto look back billions of years ago(BYA)Origin oflifeOrigin ofeukaryotesinsectsFungi/animalPlant/animalEarliestfossilsEukaryote/archaeaPage 56When you do a pairwise alignment of homologous human and plant proteins,you are studying sequences that last shared a common ancestor 1.5 billion years ago!fly GAKKVIISAP SAD.APM.F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM.F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM.F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM.F VKGANFDKY.AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM.F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA Multiple sequence alignment ofglyceraldehyde 3-phosphate dehydrogenases:example of extremely high conservationPage 57Outline:pairwise alignment Overview and examples Definitions:homologs,paralogs,orthologs Assigning scores to aligned amino acids:Dayhoffs PAM matrices Alignment algorithms:Needleman-Wunsch,Smith-WatermanPage 93Emile Zuckerkandl and Linus Pauling(1965)considered substitution frequencies in 18 globins(myoglobins and hemoglobins from human to lamprey).Black:identityGray:very conservative substitutions(40%occurrence)White:fairly conservative substitutions(21%occurrence)Red:no substitutions observedlys found at 58%of arg sitesPage 93Where were heading:to a PAM250 log odds scoring matrix that assigns scores and is forgiving of mismatches(such as+17 for W to Wor-5 for W to T)Page 69Page 69and to a whole series of scoring matrices such as PAM10 that are strict and do not tolerate mismatches(such as+13 for W to Wor-19 for W to T)Dayhoffs 34 protein superfamiliesProteinPAMs per 100 million yearsIg kappa chain37Kappa casein33luteinizing hormone b 30lactalbumin27complement component 327epidermal growth factor26proopiomelanocortin 21pancreatic ribonuclease21haptoglobin alpha20serum albumin19phospholipase A2,group IB 19prolactin17carbonic anhydrase C16Hemoglobin a12Hemoglobin b12Page 59Dayhoffs 34 protein superfamiliesProteinPAMs per 100 million yearsIg kappa chain37Kappa casein33luteinizing hormone b 30lactalbumin27complement component 327epidermal growth factor26proopiomelanocortin 21pancreatic ribonuclease21haptoglobin alpha20serum albumin19phospholipase A2,group IB 19prolactin17carbonic anhydrase C16Hemoglobin a12Hemoglobin b12human(NP_005203)versus mouse(NP_031812)Dayhoffs 34 protein superfamiliesProteinPAMs per 100 million yearsapolipoprotein A-II 10lysozyme9.8gastrin9.8myoglobin8.9nerve growth factor8.5myelin basic protein7.4thyroid stimulating hormone b 7.4parathyroid hormone 7.3parvalbumin7.0trypsin5.9insulin4.4calcitonin4.3arginine vasopressin 3.6adenylate kinase 13.2Page 59Dayhoffs 34 protein superfamiliesProteinPAMs per 100 million yearstriosephosphate isomerase 12.8vasoactive intestinal peptide2.6glyceraldehyde phosph.dehydrogease2.2cytochrome c2.2collagen1.7troponin C,skeletal muscle1.5alpha crystallin B chain1.5glucagon1.2glutamate dehydrogenase0.9histone H2B,member Q0.9ubiquitin0Page 59Pairwise alignment of human(NP_005203)versus mouse(NP_031812)ubiquitinDayhoffs numbers of“accepted point mutations”:what amino acid substitutions occur in proteins?Page 61Dayhoff(1978)p.346.fly GAKKVIISAP SAD.APM.F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM.F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM.F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM.F VKGANFDKY.AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM.F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA Multiple sequence alignment ofglyceraldehyde 3-phosphate dehydrogenases:columns of residues may have high or low conservationPage 57The relative mutability of amino acidsAsn134His66Ser120Arg65Asp106Lys56Glu102Pro56Ala100Gly49Thr97Tyr41Ile96Phe41Met94Leu40Gln93Cys20Val74Trp18Page 63Normalized frequencies of amino acidsGly8.9%Arg4.1%Ala8.7%Asn4.0%Leu8.5%Phe4.0%Lys8.1%Gln3.8%Ser7.0%Ile3.7%Val6.5%His3.4%Thr5.8%Cys3.3%Pro5.1%Tyr3.0%Glu5.0%Met1.5%Asp4.7%Trp1.0%blue=6 codons;red=1 codon These frequencies fi sum to 1Page 63Dayhoffs numbers of“accepted point mutations”:what amino acid substitutions occur in proteins?Page 61Dayhoffs PAM1 mutation probability matrixOriginal amino acidPage 66Dayhoffs PAM1 mutation probability matrixEach element of the matrix shows the probability that an originalamino acid(top)will be replaced by another amino acid(side)A substitution matrix contains values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids.Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments(or multiple sequence alignments)of amino acids.Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution.The two major types of substitution matrices arePAM and BLOSUM.Substitution MatrixPAM matrices are based on global alignments of closely related proteins.The PAM1 is the matrix calculated from comparisons of sequences with no more than 1%divergence.At an evolutionary interval of PAM1,one change has occurred over a length of 100 amino acids.Other PAM matrices are extrapolated from PAM1.For PAM250,250 changes have occurred for two proteins over a length of 100 amino acids.All the PAM data come from closely related proteins(85%amino acid identity).PAM matrices:Point-accepted mutationsPage 63Dayhoffs PAM1 mutation probability matrixPage 66Dayhoffs PAM0 mutation probability matrix:the rules for extremely slowly evolving proteinsTop:original amino acidSide:replacement amino acidPage 68Dayhoffs PAM2000 mutation probability matrix:the rules for very distantly related proteinsPAMAAlaRArgNAsnDAspCCysQGlnEGluGGlyA8.7%8.7%8.7%8.7%8.7%8.7%8.7%8.7%R4.1%4.1%4.1%4.1%4.1%4.1%4.1%4.1%N4.0%4.0%4.0%4.0%4.0%4.0%4.0%4.0%D4.7%4.7%4.7%4.7%4.7%4.7%4.7%4.7%C3.3%3.3%3.3%3.3%3.3%3.3%3.3%3.3%Q3.8%3.8%3.8%3.8%3.8%3.8%3.8%3.8%E5.0%5.0%5.0%5.0%5.0%5.0%5.0%5.0%G8.9%8.9%8.9%8.9%8.9%8.9%8.9%8.9%Top:original amino acidSide:replacement amino acidPage 68PAM250 mutation probability matrixTop:original amino acidSide:replacement amino acidPage 68PAM250 log oddsscoring matrixPage 69Why do we go from a mutation probabilitymatrix to a log odds matrix?We want a scoring matrix so that when we do a pairwise alignment(or a BLAST search)we know what score to assign to two aligned amino acid residues.Logarithms are easier to use for a scoring system.They allow us to sum the scores of aligned residues(rather than having to multiply them).Page 69How do we go from a mutation probabilitymatrix to a log odds matrix?The cells in a log odds matrix consist of an“odds ratio”:the probability that an alignment is authenticthe probability that the alignment was randomThe score S for an alignment of residues a,b is given by:S(a,b)=10 log10(Mab/pb)As an example,for tryptophan,S(trp,trp)=10 log10(0.55/0.010)=17.4Page 69What do the numbers meanin a log odds matrix?A score of+2 indicates that the amino acid replacementoccurs 1.6 times as frequently as expected by chance.A score of 0 is neutral.A score of 10 indicates that the correspondence of two amino acids in an alignment that accurately representshomology(evolutionary descent)is one tenth as frequentas the chance alignment of these amino acids.Page 58Rat versus mouse globinRat versus bacterialglobinMore conservedLess conservedBLOSUM matrices are based on local alignments.BLOSUM stands for blocks substitution matrix.BLOSUM62 is a matrix calculated from comparisons of sequences with no less than 62%divergence.BLOSUM MatricesPage 70BLOSUM Matrices1006230Percent amino acid identityBLOSUM62collapseBLOSUM Matrices1006230Percent amino acid identityBLOSUM621006230BLOSUM301006230BLOSUM80collapsecollapsecollapseBlosum62 scoring matrixPage 73Percent identityEvolutionary distance in PAMsTwo randomly diverging protein sequences changein a negatively exponential fashion“twilight zone”Page 74Percent identityDifferences per 100 residuesAt PAM1,two proteins are 99%identicalAt PAM10.7,there are 10 differences per 100 residuesAt PAM80,there are 50 differences per 100 residuesAt PAM250,there are 80 differences per 100 residues“twilight zone”Page 75PAM:“Accepted point mutation”Two proteins with 50%identity may have 80 changesper 100 residues.(Why?Because any residue can besubject to back mutations.)Proteins with 20%to 25%identity are in the“twilight zone”and may be statistically significantly related.PAM or“accepted point mutation”refers to the“hits”or matches between two sequences(Dayhoff&Eck,1968)Page 75Outline:pairwise alignment Overview and examples Definitions:homologs,paralogs,orthologs Assigning scores to aligned amino acids:Dayhoffs PAM matrices Alignment algorithms:Needleman-Wunsch,Smith-WatermanWe will first consider the global alignment algorithmof Needleman and Wunsch(1970).We will then explore the local alignment algorithmof Smith and Waterman(1981).Finally,we will consider BLAST,a heuristic versionof Smith-Waterman.We will cover BLAST in detailon Monday.Two kinds of sequence alignment:global and localPage 76 Two sequences can be compared in a matrix along x-and y-axes.If they are identical,a path along a diagonal can be drawn Find the optimal subpaths,and add them up to achieve the best score.This involves-adding gaps when needed-allowing for conservative substitutions-choosing a scoring system(simple or complicated)N-W is guaranteed to find optimal alignment(s)Global alignment with the algorithmof Needleman and Wunsch(1970)Page 761 set up a matrix2 score the matrix3 identify the optimal alignment(s)Three steps to global alignment with the Needleman-Wunsch algorithmPage 761 identity(stay along a diagonal)2 mismatch(stay along a diagonal)3 gap in one sequence(move vertically!)4 gap in the other sequence(move horizontally!)Four possible outcomes in aligning two sequences12Page 77Page 77Start Needleman-Wunsch with an identity matrixPage 77Start Needleman-Wunsch with an identity matrixPage 77Fill in the matrix using“dynamic programming”Page 78Fill in the matrix using“dynamic programming”Page 78Fill in the matrix using“dynamic programming”Page 78Fill in the matrix using“dynamic programming”Page 78Fill in the matrix using“dynamic programming”Page 78Fill in the matrix using“dynamic programming”Page 78Fill in the matrix using“dynamic programming”Page 78Traceback to find the optimal(best)pairwise alignmentPage 79N-W is guaranteed to find optimal alignments,although the algorith
展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 管理文书 > 施工组织


copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!