Subscribe to Posts
Subscribe to Comments
ereerewrwerew

Google presents a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results.

To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago.

Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext.


Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at www.google.stanford.edu.

Creating a search engine which scales even to today's web presents many challenges. Fast crawling technology is needed to gather the web documents and keep them up to date. Storage space must be used efficiently to store indices and, optionally, the documents themselves. The indexing system must process hundreds of gigabytes of data efficiently. Queries must be handled quickly, at a rate of hundreds to thousands per second.

Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext, all producing better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext.

The web creates new challenges for information retrieval. The amount of information on the web is growing rapidly, as well as the number of new users inexperienced in the art of web research. People are likely to surf the web using its link graph, often starting with high quality human maintained indices such as Yahoo! Human maintained lists cover popular topics effectively but are subjective, expensive to build and maintain, slow to improve, and cannot cover all esoteric topics. Automated search engines that rely on keyword matching usually return too many low quality matches.

Aside from tremendous growth, the Web has also become increasingly commercial over time. This number grew to over 60% in 1997. At the same time, search engines have migrated from the academic domain to the commercial. Up until now most search engine development has gone on at companies with little publication of technical details. This causes search engine technology to remain largely a black art and to be advertising oriented

The Eyes - The eyes of the search engine are the spiders (AKA robots or crawlers). These are the 1s and 0s that the search engines send out over the Internet to retrieve documents. In the case of all the major search engines the spiders crawl from one page to another following the links, as you would look down various paths along your way.

When a user searches AllTheWeb or Lycos, FAST performs some analysis on the query itself. It checks for language settings and looks for linguistic cues, to match results to the language of the searcher. Additional phrasing processes recognize multiword terms, so it can search for San Francisco and New York as phrases rather than unrelated words.

The Google search engine has two important features that help it produce high precision results. First, it makes use of the link structure of the Web to calculate a quality ranking for each web page. This ranking is called PageRank

The citation (link) graph of the web is an important resource that has largely gone unused in existing web search engines. These maps allow rapid calculation of a web page's "PageRank", an objective measure of its citation importance that corresponds well with people's subjective idea of importance. Because of this correspondence, PageRank is an excellent way to prioritize the results of web keyword searches. For most popular subjects, a simple text matching search that is restricted to web page titles performs admirably when PageRank prioritizes the results (demo available at google.stanford.edu). For the type of full text searches in the main Google system, PageRank also helps a great deal.

The indexer extracts all the information from each and every document and stores it in a database. All high-quality search engines index each and every word in the documents and give a unique word Id. Then the word occurrences, which some search engines call ?hits,?

The citation (link) graph of the web is an important resource that has largely gone unused in existing web search engines. These maps allow rapid calculation of a web page's "PageRank", an objective measure of its citation importance that corresponds well with people's subjective idea of importance. Because of this correspondence, PageRank is an excellent way to prioritize the results of web keyword searches. For most popular subjects, a simple text matching search that is restricted to web page titles performs admirably when PageRank prioritizes the results (demo available here). For the type of full text searches in the main Google system, PageRank also helps a great deal.

The system design is scalable and highly parallel, distributed search, so each query goes across multiple machines. They choose cost-effective, mid-quality commodity PCs running Linux. Of the 10,000 machines, several fail every day, because they run so much more than normal desktops, so they have designed in search redundancy, assuming some of the machines may fail at any time.

via gistweb

0 Response to 'Anatomy of Search Engine'