Subscribe to Posts
Subscribe to Comments
ereerewrwerew
Showing posts with label search engine. Show all posts
Showing posts with label search engine. Show all posts

Google presents a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results.

To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago.

Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext.


Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at www.google.stanford.edu.

Creating a search engine which scales even to today's web presents many challenges. Fast crawling technology is needed to gather the web documents and keep them up to date. Storage space must be used efficiently to store indices and, optionally, the documents themselves. The indexing system must process hundreds of gigabytes of data efficiently. Queries must be handled quickly, at a rate of hundreds to thousands per second.

Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext, all producing better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext.

The web creates new challenges for information retrieval. The amount of information on the web is growing rapidly, as well as the number of new users inexperienced in the art of web research. People are likely to surf the web using its link graph, often starting with high quality human maintained indices such as Yahoo! Human maintained lists cover popular topics effectively but are subjective, expensive to build and maintain, slow to improve, and cannot cover all esoteric topics. Automated search engines that rely on keyword matching usually return too many low quality matches.

Aside from tremendous growth, the Web has also become increasingly commercial over time. This number grew to over 60% in 1997. At the same time, search engines have migrated from the academic domain to the commercial. Up until now most search engine development has gone on at companies with little publication of technical details. This causes search engine technology to remain largely a black art and to be advertising oriented

The Eyes - The eyes of the search engine are the spiders (AKA robots or crawlers). These are the 1s and 0s that the search engines send out over the Internet to retrieve documents. In the case of all the major search engines the spiders crawl from one page to another following the links, as you would look down various paths along your way.

When a user searches AllTheWeb or Lycos, FAST performs some analysis on the query itself. It checks for language settings and looks for linguistic cues, to match results to the language of the searcher. Additional phrasing processes recognize multiword terms, so it can search for San Francisco and New York as phrases rather than unrelated words.

The Google search engine has two important features that help it produce high precision results. First, it makes use of the link structure of the Web to calculate a quality ranking for each web page. This ranking is called PageRank

The citation (link) graph of the web is an important resource that has largely gone unused in existing web search engines. These maps allow rapid calculation of a web page's "PageRank", an objective measure of its citation importance that corresponds well with people's subjective idea of importance. Because of this correspondence, PageRank is an excellent way to prioritize the results of web keyword searches. For most popular subjects, a simple text matching search that is restricted to web page titles performs admirably when PageRank prioritizes the results (demo available at google.stanford.edu). For the type of full text searches in the main Google system, PageRank also helps a great deal.

The indexer extracts all the information from each and every document and stores it in a database. All high-quality search engines index each and every word in the documents and give a unique word Id. Then the word occurrences, which some search engines call ?hits,?

The citation (link) graph of the web is an important resource that has largely gone unused in existing web search engines. These maps allow rapid calculation of a web page's "PageRank", an objective measure of its citation importance that corresponds well with people's subjective idea of importance. Because of this correspondence, PageRank is an excellent way to prioritize the results of web keyword searches. For most popular subjects, a simple text matching search that is restricted to web page titles performs admirably when PageRank prioritizes the results (demo available here). For the type of full text searches in the main Google system, PageRank also helps a great deal.

The system design is scalable and highly parallel, distributed search, so each query goes across multiple machines. They choose cost-effective, mid-quality commodity PCs running Linux. Of the 10,000 machines, several fail every day, because they run so much more than normal desktops, so they have designed in search redundancy, assuming some of the machines may fail at any time.

via gistweb

Today Live Search has been updated with a new Home Page which shows a background image. This was on trial since last month and many were already getting this. These interactive areas highlight parts of the image and help you explore search results related to the highlighted area. If you have missed to notice these hotspots while loading, just move your mouse cursor over the image to discover them. Its indeed a nice place to start a search as it keeps you engaging

The home page contains a background image, which contains "hotspots," used to show off some of the search queries available at Live Search. The home page background images are suppose to change over time and thus generate different query "hotspots," for the end users to play with

The background homepage image is set to change regularly, and with it, the hotspots that click through to interesting search results. When the page first loads, hotspots "gleam" to the user, and then fade back into the image. A user can check them out by finding them with his or her mouse cursor. Nevertheless, it is an innovative way to get users interested in Live Search. This is only the beginning of new features though, according to the Live Search team, which plans to keep on tweaking the homepage.

Microsoft has launched a new concept for Live Search. Now, when you visit the search engine, you'll see a background image with "hotspots." Hotspots are small boxes that highlight a portion of the image.

Google now gives searchers more notification of behind-the-scenes ways it customizes results. Live Search's home page gets "hotspots" -- will these generate more searches? Search engine relevancy and more.


via gistweb


Yahoo has made some changes to its algorithm and ranking factors. Yahoo calls its index update a weather update and Yahoo Weather updates are posted in the their Official blog by some Search Engineer

Search turbulence all over the map While Yahoo!'s index update may not have been universally well-received - it happens - MSN's seems to have produced a weird, temporary bug. SE Roundtable noticed yesterday a Digital Point thread pointing out that the top five search listings on the first two pages of MSN results were identical. One of MSN's forum reps, "MSNdude,” popped in today to confirm an index update. No pain, no gain, right?

The Yahoo Search Blog announced the August 2008 weather report today. Sharad Verma of Yahoo Search said, "We'll be rolling out some changes to our crawling, indexing and ranking algorithms over the next few days. As you know, throughout this process you may see some ranking changes and page shuffling in the index, but expect the update will be completed soon."

We'll be rolling out some changes to our crawling, indexing and ranking algorithms over the next few days. As you know, throughout this process you may see some ranking changes and page shuffling in the index, but expect the update will be completed soon.

Google's keyword tool bugs out. Yahoo is as relevant as Google? SEOs split on buying no followed links. AdWords will be down on the 9th. Do you do client meetings? Yahoo and Google show off Olympic searches. Google launches music search in China

via searchengineland | Yahoo Search Blog

Google's Ben Gomes of the search quality team blogged about Google's search interface.
Ben Gomes talks about "the principles that guide our development of the overall search experience and how they are applied to the key aspects of search." He explains that the goal of Google is to give you the most relevant result in the fastest period of time. Google does this by applying several principles:

  • A small page, which Gomes says "is quick to download and generally faster for your browser to display."
  • Complex algorithms with a simple presentation. Google hides the complexity of search behind a clean interface.
  • Features that work everywhere, as in all languages and all countries.
  • Data driven decisions, which are assisted by experimentation.


One forum member at WebmasterWorld thinks that Google might be using different data centers for query refinements. The reason he suspects this is because he's performing the same search but finding different results

A number of readers have noted Google's open sourcing of their internal data interchange format, called Protocol Buffers (here's the code and the doc). Google elevator statement for Protocol Buffers is "a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more." It's the way data is formatted to move around inside of Google.

Google's too many redirects issue. Google Maps adds features. Cre8asite celebrates 6 years. Yahoo released first monkey into the wild. Yahoo Maps adds features. Yahoo Shortcuts gets nasty. Live Search promoting with cash back again.

Google processes many links per day, many. Google tests related searches inline with search results. Yahoo adds search marketing features. Google Maps updates features and fixes the local business center. Webmaster tools has a 404 error reporting issue. Google Analytics has several bugs. Yahoo still needs to fix publisher reports.

via gistweb | seroundtable

Viewing the page using a mirror makes it somewhat easier to read, and would allow someone to find a website. Web site "mirroring" normally involves copying the contents of a site and hosting on a different server. This can be useful if one server is particularly busy.

With the new approach, Google reasons readers won't have to pore through search results listing the same story posted on different sites. That should in turn make it easier to discover other news stories at other Web sites that might previously have been buried, said Josh Cohen, the business product manager for Google News.

Google News was created as a tool that clusters related stories so you can read different perspectives on the same event. Unlike Yahoo News, Google News doesn't have editors: the homepage and all the other sections are generated algorithmically. Until this month, the site didn't host original content, so you could only find headlines, snippets and thumbnails from articles. To read the entire article, you had to go to a different site. Google was sued by many news organizations, including AFP, for copyright infringement and some of them won.

Google (NSDQ: GOOG) has taken the next step in the evolution of its Google News product: it has begun hosting material produced by The Associated Press, Agence France-Presse, The Press Association in UK and The Canadian Press. An AP example is here. It will still be linking to those stories on numerous newspaper/broadcast websites and other online sources that syndicate stories from these agencies, but could potentially diminish traffic these sites as users stay on Google News' pages and possibly even affect their online ad revenues, if you want to stretch the logic.

Although Google already had bought the right to display content produced by all four services affected by the change, the search engine's news section had continued to link to the sites of other Web publishers to read the stories and look at the photographs. For example, a Google News user who clicked on an AP story about the latest developments in Iraq would be steered to one of the hundreds of Web sites that also have the right to post the same article.

via gistweb


Google Knol is a free online collaborative knowledge database or an experts' wiki but not an encyclopedia. Knol is not a direct competitor of Wikipedia, at least not in its current version. Wikipedia is anonymous -- there is no single editor in charge. In contrast, Knol includes the author name in the URL of the article. Google expects multiple knols on one subject rather than the current Wikipedia model of one article on a subject. The term "knol" ("unit of knowledge") refers to both the project and an article in the project.

More specifically, Google says Knols are authoritative articles about specific topics, written by people who know about those subjects. The company first announced Knol late last year, and instead of populating its databases with articles from everyone, Google kept Knol an invitation-only party -- until now.


Right now, a lot of the existing Knols – Google defines a knol as a "unit of knowledge”, and perhaps this will be how people name articles hosted on Knol too – deal with subjects of a more serious or scientific nature, like health. Google's help page says you can write "(Almost) anything you like,” adding that you pick the subject "and write it the way you see fit” as they don't edit knols nor do they "try to enforce any particular viewpoint” subject to the terms of service and their content policy, which disallows e.g[.] images containing nudity, and "spam,” a rather broad term in this case.

Google's Knol is an attempt to harness the vast forests of knowledge trapped inside people's heads and make it more widely available via the Web. Rather than the often anonymous group effort that makes up a Wikipedia article entry, Knol seeks to pull out that knowledge primarily from one specific head.

Moves by Google into mobile phones with Android and the bid for mobile spectrum in the United States should be welcomed, because they bring new competition into a traditional market; likewise Google's attempts to break into radio and TV advertising. Knol on the other hand brings the power of Google into a marketplace that is already rich with competition, and a marketplace where Google can use its might to crush that competition by favoring pages from Knol over others, on what is the worlds most popular search engine.

Google are also looking to have Knol references put in Wikipedia. If that happens, which is likely if the quality of Knols is high, then it will migrate people away from Wikipedia and towards Knol use. If the information is more reliable then this can't be viewed as a bad migration.

Looks like a fantasy scenario, but it is not that much of a stretch. Danny Sullivan from Search Engine Land has been conduction some tests with Knol pages, and the results are surprising. of all the Knol pages that he created for test pages were ranking on the first page of Google's results after one day.

On the more consumer side, Google could use the atomized bits of knowledge (knols) created by authors to fuel a more semantically rich Web of connections. With tags, ratings, comments and other rich metadata and Semantic Web technologies, or even just the statistical approach Google prefers, knols could provide a framework for more complex and even natural language queries

Some new publishers decide to license their work via Creative Commons (hoping to be paid back based on the links economy), but Google wants no part in that! All outbound links on Knol are nofollow, so even if a person wants to give you credit for your work Google makes it impossible to do so.

While Wikipedia and Knol share some attributes, Google is a business, so where's the money? Authors can -- at their discretion -- sign up with Google's AdSense program, let Google serve up advertisements next to their Knols, and possibly earn some revenue for sharing their knowledge.

This is not to say that all of Knol will be spam. Indeed, it's likely that the prominence of having content within a Google-hosted service may attract some outstanding authors. Manber certainly expect this, saying that he hopes content is created that will be so good that Google itself will rank it tops in searches.

via gistweb





Cuil, the start-up founded by Tom Costello and two former Google employees: Anna Patterson and Russell Power, unveiled a search engine that claims to have more than 120 billion pages in the index. According to Cuil, that's "three times as many as Google and ten times as many as Microsoft."



Cuil, pronounced ...cool..., is an old Irish word for knowledge and is the brainstorm of husband and wife team, Tom Costello and Anna Patterson. The duo have an impressive combination of experience in Internet search. CEO Costello has a extensive background in developing and researching search engines at Stanford University and IBM search. Patterson is a former Google employee and acting President and COO. The Cuil co-founders added the expertise of Russell Power, a former colleague of Patterson?s from Google.

It needs to be mentioned that Anna Patterson was a former employee of Google and that the last search engine created by Anna impressed Google so much that the industry leader decide to buy the technology in the year 2004. The Cuil founders have stated that around 120 billion web pages have been used to build up the index of Cuil and they are of the opinion that the figure is far bigger than Google or Microsoft.

A quick hands-on with Cuil showed that the best thing about the new search engine so far is its interface and design. Searching isn't quiet as effective yet as Google on most subjects, but it's still decent if you're searching for the most popular items on that particular topic.

"Cuil's goal is to solve the two great problems of search: how to index the whole Internet - not just part of it - and how to analyze and sort out its pages so you get relevant results." Cuil thinks that today's search engines can't index all the information that is available on the web (more than one trillion pages, according to Google). Even Google admits that it's selective: "many [web pages] are similar to each other, or represent auto-generated content that isn't very useful to searchers".

Cuil isn't the first Google rival to launch this year. Wikia Search, a highly anticipated search engine from Wikipedia founder Jimmy Wales, made its official debut in January . Wikia Search hopes to provide better search results by allowing a community of users to index pages by using their Web page rankings and other suggestions, as well as its own indexing of the Web

resources: gistweb


SEROUNDTABLE

Does the Amount of Content Matter for SEO?

Everyone emphasizes that content is king. But does the amount of content make a difference?
The number of pages is insignificant. Content still is king, but the amount is up for discussion.

When it comes to content, quantity is not the concern. Instead quality is of utmost importance. As forum member Torka puts it, the most valuable sites to search engines (and users, of course) are those that offer useful and original content, provide a useful service, and sell products that people want or need.

[Read Full Entry ]

redflymarketing

A Press Release is NOT a PR Campaign

There’s a pretty common misconception (especially on the Web where press releases are booming) that one press release alone is supposed to bring massive exposure, traffic, and links. Well sorry folks. In the vast majority of cases, that’s just not how it works! There is a big difference between an press release and a PR campaign, especially online. When asked about how effective a press release really really is, I hear thee following kinds of things a lot, generally from people who issued their first press release without the results they were hoping for…

[Read full entry]

SEOMOZ
The Evil Side of Google? Exploring Google's User Data Collection

Google Inc. is first and foremost a data company. In the past, it competed on a level playing field by manipulating publicly available data better than its competition. By doing this, it had unprecedented success. Enter Web 2.0. Hard drives, processors, bandwidth and even workers are now all relatively inexpensive. This has caused the barriers to entry in the search field to drastic...

Read Full Entry

Searchenginewatch

More Usability and SEO AU NATURAL
More and more SEOs are becoming usability experts. Here's why.

Read full entry

Searchenginewatch

No, The Toolbar Does Not Lead Google To Index My Content

Matt Cutts of Google has posted another one of his debunking theories post, this time on the Google Toolbar aiding Google in indexing content. Matt said, "if Ken Simpson is implying that the Google Toolbar led to these urls being crawled, then he’s mistaken." Ken Simpson was paraphrased in an.

Read full entry

searchenginejournal

Google Penalties and Building Trust : Ongoing SEO Task?

Google has been known to issue out all kinds of penalties over the years; even Search Engine Journal was penalized a while back (or our Google Toolbar PageRank was). In addition to Google PageRank penalties, there are all kinds of reasons to get penalized by Google which range from building too many obvious spam links to publishing duplicate content, hoaxes or untrusted material.

Read Full Entry

searchenginejournal

Google Buying Digg for $200 Million?

Rumors are spreading all over the blogosphere that Google and Digg are talking acquisition again as Google will be buying Digg for $200 million and bringing it in under the Google News umbrella, as TechCrunch is reporting (no word if the two parties are holed up in a secret hotel room location outside of MountainView or not).

Read Full Entry