Google and Harvard University have launched a searchable database of over five million books – about four percent of all books that have ever been printed.
The archive can be searched with a new tool called Ngram Viewer, which the developers say will allow researchers to identify and quantify cultural trends.
Google acknowledges that the analysis of literature involves a lot more than numbers, but says its new tool opens up a field it calls cultureomics‘ – a quantitative analysis of particular phrases and therefore of ideas.
The database itself includes 5.2 million books in Chinese, English, French, German, Russian and Spanish, published between 1800 and 2000. The datasets contain phrases of up to five words with counts of how often they occurred in each year.
It’s possible, therefore, to quantify trends such as the comparative popularity of different musical instruments, or the rise in the popularity of tofu versus hot dogs.
Users can search the entire database, or a specific section such as American English.
“The Ngram Viewer lets you graph and compare phrases from these datasets over time, showing how their usage has waxed and waned over the years,” says engineering manager Jon Orwant.
“One of the advantages of having data online is that it lowers the barrier to serendipity: you can stumble across something in these 500 billion words and be the first person ever to make that discovery.”