搜寻引擎简介课件

上传人:沈*** 文档编号:241381608 上传时间:2024-06-22 格式:PPT 页数:31 大小:188.50KB
返回 下载 相关 举报
搜寻引擎简介课件_第1页
第1页 / 共31页
搜寻引擎简介课件_第2页
第2页 / 共31页
搜寻引擎简介课件_第3页
第3页 / 共31页
点击查看更多>>
资源描述
What is a search engine?vA web service site for the Internet Users to find information in the Internet Cyberspace。vThe software to provide web search serviceUse of search engines?vSearch for the url of a company/websitevLook for the contact info about a person or an organizationvSearch for information related to a term,eg.to collect information about 櫻花鉤吻鮭vLook for news regarding XXXvTreat the search engine as a big dictionaryv.Types of search enginesvDirectory browse/searchvWeb pages searchvUSENET news searchvFtp searchvPeople/organization searchvDaily-life information searchvLibrary searchvCommercial product searchExample search enginesvYahoo,vGoogle,vAltaVista,vMSN,vExcite,vLycos,.vYAM,Kimo,PCHomevGAIS,Openfind,.vDejaNews,vArchie,.Portal ServicesvDirectory/SearchvDaily information:Weather,Maps.TV,.vFree Emails,Free Pages,CalendarvPersonalized services,channel subscriptionvWeb Chat,vE-Commerce,vContent Aggregationv.Directory implementationvEach url data is a recordvThe url data is managed by a database systemvSearch function is supported for searching the data in the directory treeDirectory implementationvThe search is in general for locating a website or a category of web sites.vThe data input is through manual registration by the website owner or the suffervThe management of the directory tree needs intensive labor work by people who are familiar with certain domain knowledge The Advantages/Disadvantages of Directory search enginevAdvantagesThe data is manually maintained,and thuscontains less noise,and is more precise.The output of search can be categorized and can be more organizedCan support search within a category The Advantages/Disadvantages of Directory search engineDisadvantages:The data coverage is limited,and sometimes,can not find wantedDoes not support relevance rankingLabor intensiveImplementation of Webpage search engine1.Feature consideration2.Data Gathering3.Data Preprocessing4.Data Indexing5.Query Processing6.Interaction7.Service tools 8.PersonalizationRequirements for WebPage search engines0.The quality of the search result in a search engine basically depends on a.the quality of the underlying data b.the search techniques such as ranking tech.1.Data coverage should be large enough2.Data needs to be filtered,such as removing redundant pages3.Full text search capability should be provided4.Relevance Ranking mechanism should be provided5.Search Speed should be fast enough6.Search features;I.e.,evaluation points:Quality,speed,scale,robustness,features,Requirements for WebPage search enginesData GatherervAlso known as spider,crawler,robot,.vPeriodically travels the web space to collect web pagesvNeed a list management to decide which and when to collectvNeed a link analyzer to generate new URL listvNeed to decide what to collect and what not to.Data GatherervGet-file function through http protocol is the basic functionvWebpage parser module used to extract link info from a retrieved page,vURL bank manager module to manage the urls to be fetched.vRobot-controller module to manage the data collection using multiple clientsIssues of RobotvSite Based vs URL basedSite based is popular such as wget,teleportrobots.txt is easier to implement in SiteBased robotURL based robot is more appropriate for large scale search engines vRetrieval Schedule,BFS is bettervIncremental RetrievalRobot IssuesvWhat to gather and what not to?vHidden web data collectionvFocused crawlingtargeting specialized content of web pagessuitable for special search enginesevaluated by precision and recallData PreprocessingvRemove redundant pagesvTransform the page into internal data format.vPerform web cross-link analysis to generate a URL databank.vFilter the data to remove data that better not be indexedvPartition the data space*Redundancy removalv15%to 20%of the web pages are replicated on different websites,e.g.,some tutorials such as Java,Perl,Python,vCan be implemented by partitioned-hashing or external sortingRanking the URLsvLink analysis is done to count the mutual reference between web pagesvA URL receiving higher number of references will get higher scoreweighted linkdiscount internal link/such as back to homevOrder the web pages in order of score such that a page with higher rank will have lower IDData PartitionvThe data is partitioned by language type vThe language partition can be done as follows:for each known language,collect certain amount of webpages of that languagebuild up high-frequent term set for each language set from the analysis of the sample datadetermine the language type by term analysisIndexervIn general,inverted file is used to generate the indexvNeed large data space for the indexing task.vFor each indexed term,an index list is generated to record which files/locations such term appears.vNeed about the same or more space as the original dataIndexer-implementation issuevData filter module is used to cope with different data sourcesvInversion module is the kernel modulevNeed to be scalable to handle continuous growing data size.Hundreds of Giga bytes Tera bytesvDistributed/Concurrent IndexingIndexer-implementation issuevTemporary space minimizationvIndex speed is crucialvMemory can be utilized to improve the index performancevHashing and Sorting is the key!Query ProcessingvUse dictionary/stop-list to preprocess the query stringvParse the query into expressions of tokensvUse index structure to locate the matchedvUse TF*IDF type technique to score the matched documentsvCombine URL scores to rank the resultSearch CGI programsvsearch agent CGI:parse the query and fork a searcher process to do the search(or use IPC to query the searcher)when the searcher returns,analyze and process the result for formatted outputprocess the result and store it in tmp result storelog query and some status infovcgi for view-next-pagevshowmatch cgiOutput controlvSite grouping:group the pages from same website togethervTitle grouping:group the pages with similar titlevSort the output according to certain criteriaInteractionvTerm Suggestion:Related termsthesaurusterm-expansionerror correctionvphoneticvspelling PersonalizationvKeeping track of a users interest such that the search result can be tuned to improve the satisfaction to the uservQuery Tracking and classificationService toolsvQuery cache to improve the performance of the Search,for queries that have been served.vUse memory cache file system to reduce the dick access overheadvMechanism for special case handlingvLog analyzer Research IssuesvHidden Web data collectionvDistributed index/searchvIndex minimization,incremental IndexingvSmart robotvIntelligent Retrieval vOutput result auto classification/clusteringvData source clustering/classificationclassifying/clustering the whole webConclusionvSize does mattervIs still searching for a better engine!
展开阅读全文
相关资源
相关搜索

最新文档


当前位置:首页 > 管理文书 > 施工组织


copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!