If you haven't read the Google Blog post announcing its new project 'Knol', named a potential Wikipedia-killer, or haven't heard the following media buzz and speculation, then you've been living under a rock. In short, Google wants to create a user-generated encyclopedia site much like Wikipedia, but with a twist. Instead of allowing anyone to change any content on the site, the Knol project will host articles on just about everything that have an author who is responsible for the upkeep and quality of the article content. This might give Knol a more authoritative edge over Wikipedia. Google is also going to allow ads on the article pages, of which the author can get a cut of the ad revenue.
I won't delve further into the technical details of how Knol differs from Wikipedia or join in the argument about which one is the better model. What I have been wondering is why Google has decided to start up this project when Wikipedia is doing a decent job at this already. Sure, Google's mission is to organize the world's information and make it universally accessible and useful. If it controls (or at least host) information, it is a lot easier to organize and make accessible to the masses. Or maybe Google feels that Wikipedia lacks an authoritative quality that teachers across the country gripe about. Anyone can change anything on Wikipedia, so how can it always be right? Maybe Google wants to provide a more reliable, more truthful (and referenceable) user-generated encyclopedia model.
OR, maybe Google is in it for the money. After all, Google is a for-profit organization in the center of a capitalist society where earning another buck is the number one priority. Why would Google want to replace Wikipedia? Perhaps Google is sending Wikipedia lots of traffic. If Google found a way to retain much of that traffic, then it could greatly increase its page views, and thus its ad revenue.
So, just how much traffic is Google sending Wikipedia's way?
Terms of the Data Collection
I decided to collect data about how Google ranks Wikipedia for a wide range of search terms. I searched for a decent word list and settled on the word list 2of12 from the 12Dicts Official 12Dicts Package ver 5.0. Next, used the Google AJAX Search API to create a process to automate the retrieval of search results for each word in the list. Because of limitations of the API, I could only retrieve the first eight search results. In the results for each word, the process searched for the first result to Wikipedia and recorded what the rank of that result, whether it was first, second, third, etc. Some words did not have any results that linked to Wikipedia. I collected the following data data between 12:54 and 18:11 CST on December 17, 2007 using this method.
The Data
Below is a summary of the search results rankings in table and graph format. You can also download the full data set that I collected by following the link under the graphs. It is the word list containing the rank for each word. You can get the word list sorted alphabetically by word or sorted by rank.
| Number of words in list: 41238 | |
| Rank 1 | 29.17% (12030 words) |
| Rank 2 | 11.30% (4658 words) |
| Rank 3 | 8.73% (3599 words) |
| Rank 4 | 5.13% (2115 words) |
| Rank 5 | 3.31% (1365 words) |
| Rank 6 | 2.23% (920 words) |
| Rank 7 | 1.66% (683 words) |
| Rank 8 | 1.20% (496 words) |
| Not in Top 8 | 37.28% (15372 words) |



Download the word list with search results rank for the first Wikipedia result for each word:
Sorted Alphabetically by Word OR Sorted by Rank
Quick Observations
* Nearly 1 in every 3 words (29.17%) in the word list have a Wikipedia result in the top result.
* Nearly 1 in every 2 words (49.19%) in the word list have a Wikipedia result in the top 3 results.
* Nearly 2 in every 3 words (62.72%) in the word list have a Wikipedia result in the top 8 results.
Word List Selection and an Assumption on Search Patterns
I probably didn't spend enough time searching for the most appropriate word list, although I didn't pick the first one I found either. I wanted something that represented the English language well, like a good clean dictionary word list, rather than one of the many techno-jargon word lists out there. I found the 12dicts package and continued searching, then came back to it when I didn't find anything else I liked better. In hindsight, what I probably should have looked for is a word list containing a sample of real Google search terms.
Keep in mind that I have a good word list, but it is not the most appropriate word list for this study. An assumption you have to make to relate this search results data to actual search patterns is that the word list I used appropriately represents the words users search for and that the words are evenly repeated among searches for a given day. This assumption is not true, so the conclusions you make based on this assumption may not fully represent real world numbers. It is difficult to improve the accuracy of these results without a word list that more accurately represents the words people search for and weights for each word that represents how often it is searched for within a certain period of time.
Further Observations
So, given the above assumption, we can draw further conclusions:
SearchEngineWatch.com says that Google has 91 million US-based searches per day. That means that among those search results:
* 26,544,700 Google search results per day contain a Wikipedia link as the top result
* 36,827,700 Google searches per day contain a Wikipedia link within the top two results
* 44,762,900 Google searches per day contain a Wikipedia link within the top three results
* 49,431,200 Google searches per day contain a Wikipedia link within the top four results
* 52,443,300 Google searches per day contain a Wikipedia link within the top five results
* 54,472,600 Google searches per day contain a Wikipedia link within the top six results
* 55,983,200 Google searches per day contain a Wikipedia link within the top seven results
* 57,075,200 Google searches per day contain a Wikipedia link within the top eight results
* 33,924,800 Google searches per day do not contain a Wikipedia link within the top eight results
For any given search term, you are more likely than not to have a Wikipedia result within the first four results on the first page. Many people first click links near the top of the first page of results, which means that a lot of traffic is directed to Wikipedia.
I am not suggesting that Google's motivation is money, but I have to wonder if it is playing a part in its decision to create a competing service to Wikipedia, just to displace Wikipedia results with Knol entries in search results and reclaim page views for ad revenue.

Leave a comment