DataMininginBioinformatics

资源描述

Peter Bajcsy,PhDAutomated Learning GroupNational Center for Supercomputing ApplicationsUniversity of Illinoispbajcsyncsa.uiuc.eduJanuary 31,2002Data Mining in Bioinformatics2OutlineIntroductionOverview of Microarray ProblemImage AnalysisData MiningValidationSummary3Introduction:Recommended Literature1.Bioinformatics The Machine Learning Approach by P.Baldi&S.Brunak,2nd edition,The MIT Press,20012.Data Mining Concepts and Techniques by J.Han&M.Kamber,Morgan Kaufmann Publishers,20013.Pattern Classification by R.Duda,P.Hart and D.Stork,2nd edition,John Wiley&Sons,20014Introduction:Microarray Problem in Bioinformatics DomainProblems in Bioinformatics Domain Data production at the levels of molecules,cells,organs,organisms,populations Integration of structure and function data,gene expression data,pathway data,phenotypic and clinical data,Prediction of Molecular Function and Structure Computational biology:synthesis(simulations)and analysis(machine learning)5Microarray Problem:Major ObjectiveMajor Objective:Discover a comprehensive theory of lifes organization at the molecular level The major actors of molecular biology:the nucleic acids,DeoxyriboNucleic acid(DNA)and RiboNucleic Acids(RNA)The central dogma of molecular biologyProteins are very complicated molecules with 20 different amino acids.6Input and Output of Microarray Data AnalysisInput:Laser image scans(data)and underlying experiment hypotheses or experiment designs(prior knowledge)Output:Conclusions about the input hypotheses or knowledge about statistical behavior of measurements The theory of biological systems learnt automatically from data(machine learning perspective)Model fitting,Inference process7Overview of Microarray ProblemData MiningMicroarray ExperimentImage AnalysisBiology Application DomainExperiment Design and HypothesisData AnalysisArtificial Intelligence(AI)Knowledge discovery in databases(KDD)Data WarehouseValidation8Artificial Intelligence(AI)CommunityIssues:Prior knowledge(e.g.,invariance)Model deviation from true model Sampling distributions Computational complexity Model complexity(overfitting)Collect DataTrain ClassifierChoose ModelChoose FeaturesEvaluate Classifier Design Cycle of Predictive Modeling9Knowledge Discovery in Databases(KDD)CommunityGeneFilter Comparison Report GeneFilter 1 Name:GeneFilter 1 Name:O2#1 8-20-99adjfinalN2#1finaladjINTENSITIESRAWNORMALIZEDORF NAMEGENE NAMECHRMF G R GF1GF2GF1GF2DIFFERENCE RATIOYAL001CTFC311 A 1 2 12.03 7.38403.83209.79194.041.92YBL080CPET11221 A 1 3 53.21 35.62 1,786.11 1,013.13 772.981.76YBR154CRPB521 A 1 4 79.26 78.51 2,660.73 2,232.86 427.871.19YCL044C31 A 1 5 53.22 44.66 1,786.53 1,270.12 516.411.41Database10Data Mining and Image Analysis StepsImage Analysis Normalization Grid Alignment Feature construction(selection and extraction)Data Mining Statistics Machine learning Pattern recognition Database techniques Optimization techniques Visualization Prior knowledgeValidation Issues Cross validation techniquesGeneFilter Comparison Report GeneFilter 1 Name:GeneFilter 1 Name:O2#1 8-20-99adjfinalN2#1finaladjINTENSITIESRAWNORMALIZEDORF NAMEGENE NAMECHRMF G R GF1GF2GF1GF2DIFFERENCE RATIOYAL001CTFC311 A 1 2 12.03 7.38403.83209.79194.041.92YBL080CPET11221 A 1 3 53.21 35.62 1,786.11 1,013.13 772.981.76YBR154CRPB521 A 1 4 79.26 78.51 2,660.73 2,232.86 427.871.19YCL044C31 A 1 5 53.22 44.66 1,786.53 1,270.12 516.411.41?11IMAGE ANALYSIS12Image Analysis:NormalizationRed BandGreen BandDynamic range of red bandDynamic range of green bandSolution:Reference points with reference valuesBeta ActinPKGHPRTBeta 2 microglobulinRubiscoAB binding proteinMajor latex proteinhomologue(MSG)Cattle and Soy ControlsArray of cattle and soy spiking controls.50 ug of cattle brain total RNA was labeled with Cy3(green).1 ul each of in vitro transcribed soy Rubisco(5 ng),AB binding protein(0.5 ng)and MSG(0.05 ng)were labeled with Cy5.The two labeled samples were cohybridized on superamine slides(Telechem,Inc.).To the right of each set of spots are five negative controls(water).13Image Analysis:Grid AlignmentSolution:Manual,semi-automatic and fully automatic alignment based on fiducials and/or global grid fitting.14Image Analysis:Feature SelectionFeatures:mean,median,standard deviation,ratiosArea:Sensitive to background noise15Image Analysis:Feature ExtractionArea is determined by image thresholding and used during feature extractionDist:2004Box:902Plane:2632110216DATA MINING 17Why Data Mining?Sequence ExampleBiology:Language and GoalsA gene can be defined as a region of DNA.A genome is one haploid set of chromosomes with the genes they contain.Perform competent comparison of gene sequences across species and account for inherently noisy biological sequences due to random variability amplified by evolutionAssumption:if a gene has high similarity to another gene then they perform the same functionAnalysis:Language and GoalsFeature is an extractable attribute or measurement(e.g.,gene expression,location)Pattern recognition is trying to characterize data pattern(e.g.,similar gene expressions,equidistant gene locations).Data mining is about uncovering patterns,anomalies and statistically significant structures in data(e.g.,find two similar gene expressions with confidence x)18Data Mining TechniquesStatisticsM achine learningD atabase techniquesPattern recognitionO ptim ization techniquesD ata m ining techniques draw fromVisualization19StatisticsInductive StatisticsStatisticsDescriptive StatisticsAre two sample sets identically distributed?Make forecast and inferencesDescribe data20Machine LearningSupervisedMachine LearningUnsupervisedReinforced“Natural groupings”Examples21Pattern RecognitionPattern RecognitionLinear Correlation and RegressionNeural NetworksStatistical ModelsDecision TreesLocally Weighted LearningNN representation and gradient based optimizationNN representation and genetic algorithm based optimizationk-nearest neighbors,support vectors22Database TechniquesDatabase Design and Modeling(tables,procedures,functions,constraints)Database Interface to Data Mining SystemEfficient Import and Export of DataDatabase Data VisualizationDatabase Clustering for Access EfficiencyDatabase Performance Tuning(memory usage,query encoding)Database Parallel Processing(multiple servers and CPUs)Distributed Information Repositories(data warehouse)MINING23Optimization TechniquesHighly nonlinear search space(global versus local maxima)Gradient based optimizationGenetic algorithm based optimizationOptimization with sampling Large search space Example:A genome with N genes can encode 2N states(active or inactive states,regulated is not considered).Human genome 230,000;Nematode genome 220,000 patterns.24VisualizationData:3D cubes,distribution charts,curves,surfaces,link graphs,image frames and movies,parallel coordinatesResults:pie charts,scatter plots,box plots,association rules,parallel coordinates,dendograms,temporal evolutionPie chartParallel coordinatesTemporal evolution25Prior Knowledge from Experiment DesignComplexity Levels of Microarray Experiments:1.Compare single gene in a control situation versus a treatment situationExample:Is the level of expression(up-regulated or down-regulated)significantly different in the two situations?(drug design application)Methods:t-test,Bayesian approach2.Find multiple genes that share common functionalitiesExample:Find related genes that are dependent?Methods:Clustering(hierarchical,k-means,self-organizing maps,neural network,support vector machines)3.Infer the underlying gene and protein networks that are responsible for the patterns and functional pathways observedExample:What is the gene regulation at system level?Directions:mining regulatory regions,modeling regulatory networks on a global scaleGoal of Future Experiment Designs:Understand biology at the system level,e.g.,gene networks,protein networks,signaling networks,metabolic networks,immune system and neuronal networks.26Types of Expected Data Mining and Analysis ResultsHypothetical Examples:Binary answers using tests of hypotheses Drug treatment is successful with a confidence level x.Statistical behavior(probability distribution functions)A class of genes with functionality X follows Poisson distribution.Expected events As the amount of treatment will increase the gene expression level will decrease.Relationships Expression level of gene A is correlated with expression level of gene B under varying treatment conditions(gene A and B are part of the same pathway).Decision trees Classification of a new gene sequence by a“domain expert”.27VALIDATION28Why Validation?Validation type:Within the existing data With newly collected dataErrors and uncertainties:Systematic or random errors Unknown variables-number of classes Noise level-statistical confidence due to noise Model validity error measure,model over-fit or under-fit Number of data points-measurement replicasOther issues Experimental support of general theories Exhaustive sampling is not permissive29Cross Validation:ExampleOne-tier cross validation Train on different data than test dataTwo-tier cross validation The score from one-tier cross validation is used by the bias optimizer to select the best learning algorithm parameters(#of control points).The more you optimize the more you over-fit.The second tier is to measure the level of over-fit(unbiased measure of accuracy).Useful for comparing learning algorithms with control parameters that are optimized.Number of folds is not optimized.Computational complexity:#folds of top tier X#folds of bottom tier X#control points X CPU of algorithm30SummaryMicroarray problem Computational biology Major objective of microarray technology Input and output of data analysisData mining and image analysis steps Image normalization,grid alignment,feature construction Data mining techniques Prior knowledge Expected results of data miningValidation Issues Cross validation techniques

展开阅读全文

DataMininginBioinformatics

最新文档