资源描述
What is a search engine?vA web service site for the Internet Users to find information in the Internet Cyberspace。vThe software to provide web search serviceUse of search engines?vSearch for the url of a company/websitevLook for the contact info about a person or an organizationvSearch for information related to a term,eg.to collect information about 櫻花鉤吻鮭vLook for news regarding XXXvTreat the search engine as a big dictionaryv.Types of search enginesvDirectory browse/searchvWeb pages searchvUSENET news searchvFtp searchvPeople/organization searchvDaily-life information searchvLibrary searchvCommercial product searchExample search enginesvYahoo,vGoogle,vAltaVista,vMSN,vExcite,vLycos,.vYAM,Kimo,PCHomevGAIS,Openfind,.vDejaNews,vArchie,.Portal ServicesvDirectory/SearchvDaily information:Weather,Maps.TV,.vFree Emails,Free Pages,CalendarvPersonalized services,channel subscriptionvWeb Chat,vE-Commerce,vContent Aggregationv.Directory implementationvEach url data is a recordvThe url data is managed by a database systemvSearch function is supported for searching the data in the directory treeDirectory implementationvThe search is in general for locating a website or a category of web sites.vThe data input is through manual registration by the website owner or the suffervThe management of the directory tree needs intensive labor work by people who are familiar with certain domain knowledge The Advantages/Disadvantages of Directory search enginevAdvantagesThe data is manually maintained,and thuscontains less noise,and is more precise.The output of search can be categorized and can be more organizedCan support search within a category The Advantages/Disadvantages of Directory search engineDisadvantages:The data coverage is limited,and sometimes,can not find wantedDoes not support relevance rankingLabor intensiveImplementation of Webpage search engine1.Feature consideration2.Data Gathering3.Data Preprocessing4.Data Indexing5.Query Processing6.Interaction7.Service tools 8.PersonalizationRequirements for WebPage search engines0.The quality of the search result in a search engine basically depends on a.the quality of the underlying data b.the search techniques such as ranking tech.1.Data coverage should be large enough2.Data needs to be filtered,such as removing redundant pages3.Full text search capability should be provided4.Relevance Ranking mechanism should be provided5.Search Speed should be fast enough6.Search features;I.e.,evaluation points:Quality,speed,scale,robustness,features,Requirements for WebPage search enginesData GatherervAlso known as spider,crawler,robot,.vPeriodically travels the web space to collect web pagesvNeed a list management to decide which and when to collectvNeed a link analyzer to generate new URL listvNeed to decide what to collect and what not to.Data GatherervGet-file function through http protocol is the basic functionvWebpage parser module used to extract link info from a retrieved page,vURL bank manager module to manage the urls to be fetched.vRobot-controller module to manage the data collection using multiple clientsIssues of RobotvSite Based vs URL basedSite based is popular such as wget,teleportrobots.txt is easier to implement in SiteBased robotURL based robot is more appropriate for large scale search engines vRetrieval Schedule,BFS is bettervIncremental RetrievalRobot IssuesvWhat to gather and what not to?vHidden web data collectionvFocused crawlingtargeting specialized content of web pagessuitable for special search enginesevaluated by precision and recallData PreprocessingvRemove redundant pagesvTransform the page into internal data format.vPerform web cross-link analysis to generate a URL databank.vFilter the data to remove data that better not be indexedvPartition the data space*Redundancy removalv15%to 20%of the web pages are replicated on different websites,e.g.,some tutorials such as Java,Perl,Python,vCan be implemented by partitioned-hashing or external sortingRanking the URLsvLink analysis is done to count the mutual reference between web pagesvA URL receiving higher number of references will get higher scoreweighted linkdiscount internal link/such as back to homevOrder the web pages in order of score such that a page with higher rank will have lower IDData PartitionvThe data is partitioned by language type vThe language partition can be done as follows:for each known language,collect certain amount of webpages of that languagebuild up high-frequent term set for each language set from the analysis of the sample datadetermine the language type by term analysisIndexervIn general,inverted file is used to generate the indexvNeed large data space for the indexing task.vFor each indexed term,an index list is generated to record which files/locations such term appears.vNeed about the same or more space as the original dataIndexer-implementation issuevData filter module is used to cope with different data sourcesvInversion module is the kernel modulevNeed to be scalable to handle continuous growing data size.Hundreds of Giga bytes Tera bytesvDistributed/Concurrent IndexingIndexer-implementation issuevTemporary space minimizationvIndex speed is crucialvMemory can be utilized to improve the index performancevHashing and Sorting is the key!Query ProcessingvUse dictionary/stop-list to preprocess the query stringvParse the query into expressions of tokensvUse index structure to locate the matchedvUse TF*IDF type technique to score the matched documentsvCombine URL scores to rank the resultSearch CGI programsvsearch agent CGI:parse the query and fork a searcher process to do the search(or use IPC to query the searcher)when the searcher returns,analyze and process the result for formatted outputprocess the result and store it in tmp result storelog query and some status infovcgi for view-next-pagevshowmatch cgiOutput controlvSite grouping:group the pages from same website togethervTitle grouping:group the pages with similar titlevSort the output according to certain criteriaInteractionvTerm Suggestion:Related termsthesaurusterm-expansionerror correctionvphoneticvspelling PersonalizationvKeeping track of a users interest such that the search result can be tuned to improve the satisfaction to the uservQuery Tracking and classificationService toolsvQuery cache to improve the performance of the Search,for queries that have been served.vUse memory cache file system to reduce the dick access overheadvMechanism for special case handlingvLog analyzer Research IssuesvHidden Web data collectionvDistributed index/searchvIndex minimization,incremental IndexingvSmart robotvIntelligent Retrieval vOutput result auto classification/clusteringvData source clustering/classificationclassifying/clustering the whole webConclusionvSize does mattervIs still searching for a better engine!
展开阅读全文