资源描述
Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,9/29/2017,#,h,g19(GRCh37),vs.,hg38(GRCh38),Human Genome Reference,Comparison,Zuotian Tatum,Department of Human Genetics,Leiden University Medical Center,Timeline,GRCh37:,First release:,Feb 27,2009,Latest patch:,Jun 28,2013(p13),GRCh38:,First release:,Dec 24,2013,Latest patch:,Oct 14,2014(p1),http:/www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/data/,Content,GRCh37.p13:,Total bases:,3.23 Billion,2.99 Billion(without N),N50:,46 Million,Number of,alternative loci:,9,Non-nuclear genome:,No,GRCh38.p2:,Total bases:,3.21 Billion,3.05 Billion(without N),N50:,67 Million,Number of alternative loci:,261,Non-nuclear genome:,Yes,http:/www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/data/,UCSC tracks for GRCh38,UCSC RefSeq available since April 2014.,Ensembl regulatory build available since September 2014.,dbSNP 141 available since October 2014.,ENCODE and FANTOM5 track hubs are still not available(Nov 2014).,New in GRCh38 release,Three new sequence files,in addition to the standard assembly files:,-,GCA_000001405.15_GRCh38_top-level.fna.gz,-,GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz,-GCA_000001405.15_GRCh38_full_analysis_set.fna.gz,The analysis set files are created to avoid false mapping in NGS alignment pipelines.,GCA_000001405.15_GRCh38_top-level.fna.gz,A,ll,the top-level objects in the,full-assembly,Chromosomes,unlocalized scaffolds,unplaced scaffolds,alternate,locus,scaffolds,mitochondrial genome,The sequence,identifiers are International Sequence Database,Collaboration(INSDC,)accession.versions and the definition lines are GenBank style,.,No sequences have been hard-masked.,GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz,C,hromosomes,from the GRCh38 Primary Assembly unit.,Note,:the two PAR regions on chrY have been hard-masked with Ns.,The chromosome,Y sequence provided therefore has the same coordinates,as the,GenBank sequence but it is not identical to the GenBank,sequence.Similarly,duplicate copies of centromeric arrays and WGS on,chromosomes,5,14,19,21&22 have been hard-masked with Ns,.,M,itochondrial,genome from the GRCh38 non-nuclear assembly unit,.,U,nlocalized,scaffolds from the GRCh38 Primary Assembly unit,.,U,nplaced,scaffolds from the GRCh38 Primary Assembly unit,.,Epstein-Barr,virus(EBV),sequence,Note,:The EBV sequence is not part of the genome assembly but is,included,in the analysis set as a sink for alignment of reads,that are,often present in sequencing samples.,GCA_000001405.15_GRCh38_full_analysis_set.fna.gz,=,GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz,+,alt-scaffolds,from the GRCh38 ALT_REF_LOCI_*assembly units,Alt-loci add complexity to RNASeq quantification,Ideogram of GRCh38.p2,RNASeq quantification,-Fragments(reads)per million per killobase(FPKM/RPKM)values to quantify gene expression,-Unique mapping only,Analysis tools do,not distinguish allelic duplication from paralogous duplication,-Non overlapping gene regions,To understand the effect of alt-loci on RNASeq quantification,Compare alignment of chromosome 6 MHC region between,-hg19 full set with 7 alt-loci,-hg38 analysis set without alt-loci,Sequence content are largely unchanged between hg19 and hg38.,Mapping/alignment for RNASeq,hg19,hg38,mapped,14,655,299,14,704,427,mappedDiffChr,4,959,4,017,mappedPairProper,14,639,261,14,690,090,mappedPairProperPct,92.62,92.94,total,15,805,561,15,805,561,totalSplice,5,060,829,5,078,133,unmapped,1,150,262,1,101,134,hg19:with alt loci,hg38:without alt loci,Effect of alt loci in RNASeq alignments,Gene RPKM(hg38),Distribution of RPKM difference,Major Histocompatibility complex region on chromosome 6,HLA-A,hg19 full set chr6,D1,hg19 full set chr6_mann_hap4,D1,hg19 full set chr6_qb1_hap6,D1,hg19 full set chr6_dbb_hap3,D1,HLA-A,hg19 full set chr6,hg38 analysis set,D1,D2,D3,D1,D2,D3,HLA-C,hg19 full set,D1,D2,D3,hg38 analysis set,D1,D2,D3,HLA-DRA,hg19 full set,D1,D2,D3,hg38 analysis set,D1,D2,D3,Major Histocompatibility complex region on chromosome 6,Class III,MHC Class III,700kb stretch,60 genes.,The most gene-dense region of the human genome,14%coding,72%transcribed,Highly conserved,Only a free have clearly defined and proven function,TNF,hg19 full set chr6,D1.control,D1.treated,hg38 analysis set chr6,D1.control,D1.treated,Highly variant immune regions,retiled,LILRA3 moved to alt-loci in hg38,hg19,hg38,LILRB2LILRA3LILRA5,LILRB2 LILRA5,Phantom LILRA3,LILRA3 in hg19,Intergenic,LILRB3,LILRA4,LILRB5,Gene length calculation,We need gene length for calculating RPKM.,If alignment uses alt loci,RPKM would be artificially lowered for alt loci genes.,If alignment does not alt loci,Remove alt loci annotations from the official set.,Need more comprehensive approach to genome variation.,Assembly,model is neither haploid nor diploid,Analysis tools penalize reads mapping to 1 location,do not distinguish allelic duplication from paralogous duplication,A graph structure is a
展开阅读全文