Search is an interesting problem to explore. Fifteen years ago, Clifford Stoll wrote an article for Newsweek entitled, “The Internet? Bah! Hype alert: Why cyberspace isn’t, and will never be, nirvana”. Stoll’s article is often referenced as an example of getting it completely wrong. Today, even Stoll can look back and admit, “Wrong? Yep.“
It is easy to play Monday morning quarterback, but reading Stoll’s article, “Search” seems to be one of the central arguments.
What the Internet hucksters won’t tell you is tht the Internet is one big ocean of unedited data, without any pretense of completeness. Lacking editors, reviewers or critics, the Internet has become a wasteland of unfiltered data. You don’t know what to ignore and what’s worth reading. Logged onto the World Wide Web, I hunt for the date of the Battle of Trafalgar. Hundreds of files show up, and it takes 15 minutes to unravel them—one’s a biography written by an eighth grader, the second is a computer game that doesn’t work and the third is an image of a London monument. None answers my question…
About a year later two PhD students at Stanford, Larry Page and Sergey Brin, would change the world forever by solving Stoll’s problem with the idea of PageRank — an idea that later became the backbone of Google. Google describes PageRank as:
… [a] uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page’s value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves “important” weigh more heavily and help to make other pages “important”.
Stoll’s critical misstep was the assumption that the Internet would never be organized.
Flash forward 15 years and we find the Internet is still evolving. With this evolution, however, we have a new problem — PageRank is too slow. PageRank works really well on websites that are updated two or three times a week. It takes a long time for Google to index the entire Internet. On average, Google will only index a website once a week. Higher profile websites such as CNN, NY Times, and other major news outlets will be crawled more regularly and sometimes even in realtime, but most websites are much less frequent.
Twitter recently had breakout success as a real-time search engine. Unlike Google’s PageRank (which needs to seek out and find stories to index), Twitter is supplied with content by its users. Because Twitter users tell Twitter what is going on at that moment in time, Twitter is able to instantly calculate what is trending and provide real-time results.
Both Facebook and Google have recognized this potential and started creating competing products. Facebook’s new privacy settings and API allows users to discover content within Facebook’s real-time network. Similarily, Google recently announced Buzz, a product designed to allow Google users to share information within a network of friends.
So where are we going? We face a similar problem today as Stoll did in 1995. How do you separate out unwanted information provided through Twitter, Facebook and Google Buzz from information that is truly valuable. The most important thing to learn from Stoll is to avoid the assumption that we will never be able to filter this information — instead of abandoning real-time search, let’s figure out how to way to make it work.
There are many possible solutions. The first is to build upon the idea of PageRank and inverse the content flow. Google could build out a public API that allows content providers to easily submit content into Google’s system. Instead of forcing Google to seek out new pages, content providers can tell Google when their pages are updated.
Reversing the PageRank workflow, however, is not a silver bullet. Systems like Google need to rethink the idea of PageRank. Instead of ingesting “pages”, these systems need to ingest an idea or concept from those pages. Future systems of search need to be able to figure out what a story is talking about and assign that idea a ranking, not the webpage or article itself.
By assigning topics a ranking, instead of pages, search systems would be able to rank real-time information by observing similar topics, summaries, and stories from multiple sources. Essentially, when website A, B and C all start talking about the same story, that story becomes a trending topic. PageRank then becomes a method of weighting individual sources. If website A has a higher PageRank than website B, it would matter more when they start talking about a specific story.
Additionally, sources can be ranked within each topic or story. For example, as I mentioned in my previous post, I recently developed the Photoshop World iPhone app. My PageRank is likely not as high as some of the people who recently wrote about the app on their blogs, however, if I were to make an announcement about the app these new systems should recognize me as the highly ranked “source” on the subject — since I am the developer. Similarly, real-time sources should be ranked within their given area’s of expertise.
Fifteen Years Later
Who knows, fifteen years from now we might be laughing at articles that talked about how ridiculous tools like Twitter, Facebook and Google Buzz really are. Articles complaining about random tweets from someone’s cat and complaints about how loud a stranger’s dog barks. Articles that put too much emphasis on the noise of real-time search and did not place enough value in the information provided when that noise is filtered out. Looking forward, I believe when we find a way to cancel out the noise, real-time search will become vital to all forms of business.
As the Internet and its uses evolve, so will search. As search evolves, so will the Internet. It is an endless cycle, but one that continues to allow new innovation.