Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,*,BioInformatics,(3),Computational Issues,Data Warehousing:,Organising Biological Information into a Structured Entity(Worlds Largest Distributed DB),Function Analysis(Numerical Analysis):,Gene Expression Analysis:Applying sophisticated data mining/Visualisation to understand gene activities within an environment(Clustering),Integrated Genomic Study:Relating structural analysis with functional analysis,Structure Analysis (Symbolic Analysis):,Sequence Alignment:Analysing a sequence using comparative methods against existing databases to develop hypothesis concerning relatives(genetics)and functions(Dynamic Programming and HMM),Structure prediction:from a sequence of a protein to predict its 3D structure(Inductive LP),Data Warehousing:Mapping Biologic into Data Logic,Structure Analysis:,Alignments&Scores,Global(e.g.,haplotype,),ACCACACA,:,xx,:x:,ACACCATA,Score=5(+1)+3(-1)=2,Suffix(shotgun assembly),ACCACACA,:,ACACCATA,Score=3(+1)=3,Local(motif),ACCACACA,:,ACACCATA,Score=4(+1)=4,A comparison of the homology search and the motif search for functional interpretation of sequence information,.,Homology Search,Motif Search,New sequence,Retrieval,Similar,sequence,Expert,knowledge,Sequence interpretation,Sequence database,(Primary data),Knowledge,acquisition,Motif library,(Empirical rules),Expert,knowledge,New sequence,Inference,Sequence interpretation,Search and learning problems in sequence analysis,(Whole genome),Gene Expression Analysis,Quantitative Analysis of Gene Activities(Transcription Profiles),Gene,Expression,Biotinylated,RNA,from experiment,GeneChip expression,analysis probe array,Image of hybridized probe array,Each probe cell contains,millions of copies of a specific,oligonucleotide probe,Streptavidin,-,phycoerythrin,conjugate,(Sub)cellular,inhomogeneity,(see figure),Cell-cycle differences in expression.,XIST RNA localized on inactive,X-chromosome,Cluster Analysis,Protein/protein complex,Genes,DNA regulatory elements,Functional Analysis via,Gene Expression,Pairwise Measures,Clustering,Motif Searching/.,Clustering Algorithms,A clustering algorithm attempts to find natural groups of components(or data)based on some similarity.Also,the clustering algorithm finds the,centroid,of a group of data sets.To determine cluster membership,most algorithms evaluate the distance between a point and the cluster,centroids,.The output from a clustering algorithm is basically a statistical description of the cluster,centroids,with the number of components in each cluster.,Clusters of Two-Dimensional Data,Key Terms in Cluster Analysis,Distance&Similarity measures,Hierarchical&non-hierarchical,Single/complete/average linkage,Dendrograms,&ordering,Distance Measures:Minkowski Metric,ref,Most Common Minkowski Metrics,An Example,4,3,x,y,Manhattan distance is called,Hamming distance,when all features are binary.,Gene Expression Levels Under 17 Conditions(1-High,0-Low),Similarity Measures:Correlation Coefficient,Similarity Measures:Correlation Coefficient,Time,Gene A,Gene B,Gene A,Time,Gene B,Expression Level,Expression Level,Expression Level,Time,Gene A,Gene B,Distance-based Clustering,Assign a distance measure between data,Find a partition such that:,Distance between objects within partition(i.e.same cluster)is,minimized,Distance between objects from different clusters is maximised,Issues:,Requires defining a distance(similarity)measure in situation where it is unclear how to assign it,What relative weighting to give to one attribute,vs,another?,Number of possible partition is super-exponential,Normalized Expression Data,hierarchical&non-,Hierarchical Clustering Techniques,Hierarchical Clustering,Given a set of N items to be clustered,and an,NxN,distance(or similarity)matrix,the basic process hierarchical clustering is this:,1.Start by assigning each item to its own cluster,so that if you have N items,you now have N clusters,each containing just one item.Let the distances(similarities)between the clusters equal the distances(similarities)between the items they contain.,2.Find the closest(most similar)pair of clusters and merge them into a single cluster,so that now you have one less cluster.,3.Compute distances(similarities)between the new cluster and each of the old clusters.,4.Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.,The distance between two clusters is defined as the distance between,Single-Link,Method/Nearest Neighbor,Complete-Link,/Furthest Neighbor,Their,Centroids,.,Average,of all cross-cluster pairs.,Computing Distances,single-link clustering(also called the,connectedness,or minimum method):,we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster.If the data consist of similarities,we consider the similarity between one cluster and another cluster to be e