外文翻译--Prediction of Protein Functions from Protein-Protein Interaction Data Based on a New Measure of Network Betweenness

资源描述

Prediction of Protein Functions from Protein-ProteinInteraction Data Based on a New Measure ofNetwork BetweennessNaifang Su,Lin Wang,Yufu Wang,Minping Qianand Minghua DengSchool of Mathematical Sciences,Peking University,Beijing,P.R.ChinaCenter for Theoretical Biology,Peking University,Beijing,P.R.ChinaEmail:AbstractAssigning functions to proteins that have not beenannotated is an important problem in the post-genomic era.Meanwhile,the availability of data on protein-protein interac-tions provides a new way to predict protein functions.Previously,several computational methods have been developed to solve thisproblem.Among them,Deng et al.developed a method based onthe Markov random field(MRF).Lee et al.extended it to thekernel logistic regression model(KLR)based on the diffusionkernel.These two methods were tested on yeast benchmark data,and the results demonstrated that both MRF and KLR had highprecision in function prediction.On that basis,inspired by theidea of a Markov cluster algorithm,we defined a new measure ofnetwork betweenness,and developed a betweenness-based logisticregression model(BLR).Applying it to predict protein functionson the yeast benchmark data,we found that BLR outperformedboth the KLR and the MRF models.It is evidently that BLR isa more proper and efficient approach of function prediction.I.INTRODUCTIONUnderstanding gene function is the main problem of molec-ular biology.However,the functions of most proteins areunknown.Even for the most studied model organism,Sac-charomyces cerevisiae,about one third of proteins are stillun-annotated.With the advances in proteomics,huge amountsof data continue to be accumulated.Among the databases,protein-protein interaction(PPI)is an important source;almostall proteins play their roles by interacting with other proteins.Several algorithms have been developed to predict proteinfunctions,on the basic assumption that proteins with similarfunctions are more likely to interact.Among them,Deng pro-posed the Markov random field(MRF)model,which predictsprotein functions based on the annotated proteins and thestructure of the PPI network 12.It has been proved that theprediction of MRF have high accuracy.Further more,Lee et al.developed a diffusion kernel-based logistic regression(KLR)model,which exploited diffusion kernel to better representthe global structure of the network 3.It has been shown tohave higher accuracy than MRF.Meanwhile,Markov ClusterAlgorithm(MCL)is an effective cluster algorithm 45.Ithas been successfully applied in protein family classification.Motivated by these previous studies,we combined theadvantages of MCL and KLR and defined a new measure ofbetweenness,which can better describe the relationship be-tween proteins in the network.Based on this new betweenness,we developed a betweenness-based logistic regression model(BLR)model to predict protein functions.II.MATERIALS ANDMETHODSA.MaterialsThe data used for function prediction were from the follow-ing two datasets.The first was DIP,a PPI database6.For S.cerevisiae,DIP gave 6456 physical interactions of 2808 proteins.Afterremoving the self interactions,we constructed a graph with2808 nodes and 6212 edges.The second was MIPS,which used a biological approachto give function annotations of the proteins in yeast7.After primary analysis,we found that,for the 2808 proteinsin our network,2099 were annotated in MIPS,and theybelonged to 12 function categories.B.AnnotationsFor completeness,we first formulate the problem math-ematically,and then we introduce our method.Supposethere are M different functions F1,.,FM,and N proteinsP1,.,PN,among which P1,.,Pnare unknown proteinswhile Pn+1,.,PNare annotated.The purpose is to predictfunctions for P1,.,Pnbased on protein-protein interactiondata as well as the annotations of Pn+1,.,PN.This isa multiple labeling problem.Let X=(X1,.,XN)bethe configuration of the functional labeling.In the multiplefunction schemes,Xitakes value of m if Pihas annota-tion of Fm,(i=n+1,.,N).For the single functionschemes,focusing on the specific function F,Xitakes valueof 1 or 0 if Pihas such a specific annotation or not;hereXj(j=1,.,n)are unknown,the purpose is to estimateP(Xj=1)(j=1,.n).We solve the problem using graphic methods,the protein-protein interactions data of these proteins are easy to transformto a graph G,in which the vertices represent proteins and theedges describe the interactions between proteins.The adjacentmatrix H,an N N symmetry matrix with Hij=1 if thereexists an interaction between the i-th and the j-th node(whichis the protein in our problem),Hij=0 otherwise(i,j=1,.,N).978-1-4244-4713-8/10/$25.00 2010 IEEEC.A new measure of network betweennessWe define a new measure of network betweenness fromdiffusion kernel and Markov cluster algorithms.So we give abrief introduction of these methods.The diffusion kernel(DK)8 is widely used in graphanalysis,which is defined asK=eL(1)where L=D H,D is a diagonal matrix with diagonalelement Dii=?jHij,and is a parameter.The elementsof K can be interpreted as the transition probabilities of a lazyrandom walk and reflects its laziness,thus K represents theglobal relationship between the nodes in the graph8910.The computation of DK is very complicated and often obtainedfrom the Taylor expansionK=eL=?k=0kk!Lk(2)and approximated by limited terms.However,since Lkmayno longer be sparse during production,the convergence rateis quite low.MCL is a clustering algorithm based on the flow in thegraph45.The key features of MCL are the following twooperators:Inflation:r:M?rM,here(rM)ij=Mrijk?i=1MijrExpansion:expe:H?He(generally,e=2)The expansion operator is used to obtain the e-th power of theadjacent matrix,which can be interpreted as the e-th transitionprobability matrix,while the inflation operator is used tofurther inflate and distribute the probabilities,by increasingthose in the dense area and decreasing those in the sparsearea.The utilization of the inflation operator can improvethe convergence rate,eliminate some useless information,and reduce the influence of noise.The final matrix in MCLis a transition probability matrix,which can be interpretedas clusters or a global correlation matrix of the graph.Asthe flow of the graph is redistributed through inflation andexpansion,the final matrix can reflect the intrinsic structureof the network.Combining the computation method of DK with that ofMCL,we define a new matrix,K=1+f(1)(H)1!+f(2)(H)22!+.+f(n)(H)nn!+.(3)Where f(n)(A)is the extended inflation operator on matrixA:(f(n)A)ij=Af(n)ijN?j=1Af(n)ij(4)The final betweenness metric B is obtained by followingnormalization,Bij=Kij?j?=iKij,i?=j,(5a)0,i=j.(5b)First we use the extended inflation operator f(n)to in-crease strong interactions and decrease weak interactions foreach n,which leads to noise reduction and an increasedconvergence rate.Generally,f(n)1 and is an increasingfunction of n,which can be selected according to the real data.We assumed that f(n)has the form as naor logbn,a,b aretwo parameters learnt from the data,such as a=0.5 or b=5.Furthermore,the diagonal elements in the final betweennessmatrix are set to be zero because we ignore self transitionprobabilities.It should be noted that the computation of B isquite easy.Since we can truncated the computation ofK to asummation of limited term,and the convergence rate is in anexponential fashion.The final betweenness matrix can be explained as a transi-tion probability matrix.From the above analysis,we see thatthis definition inherits the advantages of both MCL and DK.At the same time,the application of f(n)makes the definitionvery flexible,and reflect the inherent structure of the networkmuch better.D.BLR(single function scheme)Similar to KLR developed by Lee et al.3,we developa logistic regression model based on the new betweenness(BLR),the key formula is,logP(Xi=1|Xi,)1 P(Xi=1|Xi,)=+B(i)0+B(i)1(6)where B(i)0=?i?=j,Xj=0Bij,B(i)1=?i?=j,Xj=1BijHere,are parameters,which can be estimated throughthe logistic regression based on the PPI network of theannotated proteins.Then we used ICM(Iterative ConditionalMode)algorithm 11,to obtain the prediction probability.It should be noted that such a model is a natural extensionof MRF.Replacing B with H in equation(6),we obtain theMRF model of Deng et al.1.Replacing B with the diffusionkernel K,we obtain the KLR of Lee et al.3.As the adjacentmatrix H can only reflect the direct interactions in the network,while DK,as well as our new measure of the betweenness,are able to characterize the integral connections between thenodes.E.BLR(Correlated function scheme)Similar to Lee et al.3,we can extend the BLR model byconsidering the correlations of functions,i.e,logP(Xi=Fk|Xi,)P(Xi=FM|Xi,)=k+M?l=1klB(i)l(7)where B(i)l=?i?=j,Xj=FlBij,l=1,.,KAfter parameter estimation,we use equation(7)directly topredict protein functions,rather than the ICM algorithm,toreduce the computational complexity.Meanwhile,to reducethe number of parameters in above equation,for each functionwe consider the five most related functions,where the relationbetween the functions is measured by the chi-square test3,since the other functions have little contribution for predict thefunction.F.Prediction accuracy measuresWe utilize the leave-one-out cross-validation to comparethe accuracy of the function prediction of the different meth-ods.For correlated function prediction,as the leave-one-outmethod is time consuming,the five-fold cross-validation isused instead.In addition,we use the ROC(receiver operatingcharacteristic)curve and the AUC(area under ROC Curve)score to evaluate the results of cross-validation.III.RESULTSA.cross validation-single functionFirst we predicted a single function,and leave-one-out crossvalidation was used to compare BLR with KLR and MRF.The ROC curve and the AUC score were used to measurethe prediction precision of different methods.In the KLR andBLR methods,the parameters and f(n)were determined bythe AUC scores.For different ,the results did not have distinct differences,while the AUC scores were a little larger when =0.4,and itis robust for KLR,so we chose =0.4 in both KLR and BLR.For the parameter f(n)in BLR,here we tried several parame-ters,such as f(n)=log2n,log3n,log4n,log5n,log6n,.andf(n)=n0.3,n0.4,n0.5,n0.6.,finally f(n)=log5n wastaken according to the optimization of the AUC scores.A comparison of the three methods is shown in figure 1.We found that,in almost all function categories the AUCscores in KLR were markedly higher than in MRF.This isconsistent with the results in Lee et al.3.In most functioncategories the AUC scores in BLR were higher than in KLR,which demonstrated that BLR had a higher accuracy.B.cross validation-correlated functionsThen we took into account the correlations between func-tions and approached multi-function prediction.First we applied the chi-square test to identify the fivefunctions most correlated to each function.Then we appliedFig.1.Comparison of AUC scores in MRF,KLR and BLR:Note that in8 categories the AUC scores in BLR were higher than in KLR(0.006 higheron average),and in 3 categories the AUC scores in the two methods werenearly the same(differences 0.005),while only in one category were theAUC scores in KLR higher(by 0.024).Fig.2.AUC score of correlated function prediction of KLR and BLR:Among the 12 function categories,the AUC scores in BLR were higher inten,and four of these(number 4,7,9,12)were markedly higher.the extended method to predict functions.Here,because leave-one-out was too time-consuming we used five-fold cross-validation to compare the precision of different methods.Weshowed that BLR was more precise than KLR in most functioncategories(Figure 2).The ROC curve of all twelve categories also demonstratedthat BLR performed better(Figure 3).All the above results showed that the BLR was moreaccurate than the other methods.C.Function predictionsThen we applied KLR and BLR to predict 709 unknownproteins in yeast.It should be noted that the parameters in themodels were trained through the known proteins which aresimilar to the unkown ones.Here we used the 2099 knownproteins for training and chose =0.4 and f(n)=log5n.Treated each function individually,we chose the two functionpredictions with the highest probabilities as the prediction.Among the 709 proteins,504 had the same predictions inthe two methods,200 had one overlapped prediction.Only 5proteins had totally different function predictions by the twoFig.3.ROC curves and AUC scores for correlated function prediction ofKLR and BLR.Fig.4.The AUC scores before and after adding noise:the scores did notchange much,only from 0.765563 to 0.741234 on average.methods,compared with gene annotation from other database,we found the annotations of BLR are more proper.Generally,the two methods had similar prediction results,but BLR is more consistent with the real annotation.D.RobustnessIn BLR,the prediction of a protein function depends heavilyon the PPI network.But the PPI network may have some noise,so it is important to consider the robustness of BLR.We testedthe robustness by adding 20%noise to a PPI network.In theinitial network with 6212 edges,we randomly removed 1200edges and added 1200 edges to change the structure of thenetwork.After adding noise,the AUC scores in the cross-validation dropped very slightly,which demonstrated that BLRis robust(Figure 4).IV.CONCLUSIONSeveral methods have been developed to predict proteinfunctions.The widely used methods include MRF and KLR.MRF is based only on the adjacent matrix of the network,while KLR uses DK to get more information from the networkand has a higher accuracy.Here we developed the BLRmethod for function prediction based on a new defined networkbetweenness,which reenforces the connections among themodular structures in the network.We demonstrated that BLRhad better prediction accuracy than KLR and MRF.In addition,BLR is robust to noise introduced to the protein network,andsufficiently flexible to fit to different networks.Furthermore,the new betweenness we defined can be usedin any network based problems,so our method can be appliedto many other fields,such as gene co-expression analysis andprotein complex searching.ACKNOWLEDGMENTWe thank Dr.Iain C.Bruce(Zhejiang University)forhis critical reading of the manuscript.This work is sup-ported by the National Natural Science Foundation of China(No.30570425,No.10721403,No.10871009),the NationalHigh Technology Research and Development of China(No.2006AA02Z331,No.2008AA02Z306),the National Key BasicResearch Project of China(No.2009CB918503),and theScientific Research Foundation for the Returned OverseasChinese Scholars,State Education Ministry.REFERENCES1 M.H.Deng,K.Zhang,S.Mehta,T.Chen,and F.Z.Sun,“Predictionof protein function using protein-protein interaction data,”Journal ofComputational Biology,vol.10,no.6,pp.947960,2003.2 M.H.Deng,T.Chen,and F.Z.Sun,“An integrated probabilistic modelfor functional prediction of proteins,”Journal of Computational Biology,vol.11,no.2-3,pp.463475,2004.3 H.Lee,Z.D.Tu,M.H.Deng,F.Z.Sun,and T.Chen,“Diffusionkernel-based logistic regression models for protein function prediction,”Omics-a Journal of Integrative Biology,vol.10,no.1,pp.4055,2006.4 S.v.Dongen,“Graph clustering by flow simulation,”Ph.D.dissertation,University of Utrecht,2000.5 A.J.Enright,S.Van Dongen,and C.A.Ouzounis,“An efficientalgorithm for large-scale detection of protein families,”Nucleic AcidsRes,vol.30,no.7,pp.157584,2002.6 L.Salwinski,C.S.Miller,A.J.Smith,F.K.Pettit,J.U.Bowie,and D.Eisenberg,“The database of interacting proteins:2004 update,”Nucleic Acids Res,vol.32,no.Database issue,pp.D44951,2004.7 H.W.Mewes,S.Dietmann,D.Frishman,R.Gregory,G.Mannhaupt,K.F.Mayer,M.Munsterkotter,A.Ruepp,M.Spannagl,V.Stumpflen,and T.Rattei,“Mips:analysis and annotation of genome information in2007,”Nucleic Acids Res,vol.36,no.Database issue,pp.D196201,2008.8 K.Risi Imre and D.L.John,“Diffusion kernels on graphs and otherdiscrete input spaces,”2002.9 G.R.G.Lanckriet,M.Deng,N.Cristianini,M.I.Jordan,and W.S.Noble,“Kernel-based data fusion and its application to protein functionprediction in yeast,”Pacific Symposium on Biocomputing 2004,pp.300311,2003.10 L.Sun,S.Ji,and J.Ye,“Adaptive diffusion kernel learning from bio-logical networks for protein function prediction,”BMC Bioinformatics,vol.9,p.162,2008.11 J.Besag,“On the statistical-analysis of dirty pictures,”Journal of theRoyal Statistical Society Series B-Methodological,vol.48,no.3,pp.259302,1986.

展开阅读全文

外文翻译--Prediction of Protein Functions from Protein-Protein Interaction Data Based on a New Measure of Network Betweenness

最新文档