News

Digitized book project unveils a quantitative "cultural genome"

Online tool developed by Harvard and Google can identify cultural trends across the centuries

Cambridge, Mass. - December 16, 2010 – Researchers have created a powerful new approach to scholarship, using approximately 4 percent of all books ever published as a digital “fossil record” of human culture. By tracking the frequency with which words appear in books over time, scholars can now precisely quantify a wide variety of cultural and historical trends.

The four-year effort, led in part by Erez Lieberman Aiden (S.M. '10, Ph.D. '10), principal investigator of the Laboratory-at-Large at Harvard’s School of Engineering and Applied Sciences (SEAS) and a junior fellow in Harvard’s Society of Fellows, is described this week in the journal Science.

The team, comprising researchers from Harvard, Google, Encyclopaedia Britannica, and the American Heritage Dictionary, has already used their approach—dubbed “culturomics,” by analogy with genomics—to gain insight into topics as diverse as humanity’s collective memory, the adoption of technology, the dynamics of fame, and the effects of censorship and propaganda.

“Interest in computational approaches to the humanities and social sciences dates to the 1950s,” says co-author Jean-Baptiste Michel, a postdoctoral researcher based in Harvard’s Department of Psychology and Program for Evolutionary Dynamics. “But attempts to introduce quantitative methods into the study of culture have been hampered by the lack of suitable data. We now have a massive dataset, available through an interface that is user-friendly and freely available to anyone.”

Google will release a new online tool to accompany the paper: a simple interface that enables users to type in a word or phrase and immediately see how its usage frequency has changed over the past few centuries.

"The collaboration between the Google Books Search Team and Drs. Lieberman and Michel illustrates the enormous amount we can learn about ourselves and our society—indeed, even our culture—from algorithmic analysis of the Google Books corpus," says Alfred Spector '76, Vice President of Research at Google. "In part because of this excellent work, Google sees an enormous future for the use of analytical techniques in the humanities and social sciences."


Erez Lieberman Aiden
(S.M. '10, Ph.D. '10)

 “While browsing this cultural record is fascinating for anyone interested in what’s mattered to people over time," adds Aiden, "we hope that scholars of the humanities and social sciences will find this to be a useful and powerful tool.”

This dataset, which is available for download, is thousands of times larger than any previous historical corpus. It is based on the full text of about 5.2 million books, with more than 500 billion words in total.

About 72 percent of its text is in English, with smaller amounts in French, Spanish, German, Chinese, and Russian. It is the largest data release in the history of the humanities, the authors note, a sequence of letters 1,000 times longer than the human genome.

The paper describes the development of a new, computer-aided approach to cultural analysis and surveys a vast range of applications, focusing on the past two centuries. The team’s findings include:

• Some 8,500 new words enter the English language annually, fueling a 70 percent growth of the lexicon between1950 and 2000. But many of these million-plus words can’t be found in dictionaries.

“We estimated that 52 percent of the English lexicon—the majority of words used in English books—consist of lexical ‘dark matter’ undocumented in standard references,” the researchers write in Science.

• Humanity is forgetting its past faster with each passing year. The Harvard-Google team tracked the frequency with which each year from 1875 to 1975 appeared, finding that references to the past decrease much more rapidly now than in the 19th century.References to “1880” didn’t fall by half until 1912—a lag of 32 years—but references to “1973” fell to half their peak just a decade later, in 1983.

• More recent innovations spread more quickly. For instance, inventions at the end of the 19th century spread more than twice as fast as those in the early 1800s.

• Modern celebrities are younger and more famous than their 19th-century predecessors, but their fame is shorter-lived. Celebrities born in 1950 initially achieved fame at an average age of 29, compared to 43 for celebrities born in 1800. But their fame also disappears faster, with a “half-life” that is increasingly short.

• Culturomics is a powerful tool for automatically identifying censorship and propaganda. For example, Jewish artist Marc Chagall was mentioned just once in the entire German corpus from 1936 to 1944, even as his prominence in English-language books grew roughly fivefold. Evidence of similar suppression is seen in Russian with regard to Leon Trotsky; in Chinese with regard to Tiananmen Square; and in the US with regard to the “Hollywood Ten,” a group of entertainers blacklisted in 1947.

Aiden, a former SEAS graduate student, earned his Ph.D. in the Medical Engineering and Medical Physics (MEMP) program, which is part of the collaborative Harvard-MIT Division of Health Sciences and Technology (HST). Earlier this year, he was awarded the 2010 Lemelson-MIT Student Prize for innovation and inventiveness. In 2009, he was recognized by the editors of Technology Review magazine as among the top innovators under the age of 35.

Spector, Google's Vice President of Research, earned his bachelor's degree at SEAS, in the field of applied mathematics.

Aiden and Michel's co-authors on the Science paper are Aviva Presser Aiden, Adrian Veres, Steven Pinker, and Martin A. Nowak at Harvard; Google’s Jon Orwant, Matthew K. Gray, Dan Clancy, Peter Norvig, and the Google Books Team; Yuan Kui Shen at the Massachusetts Institute of Technology; Joseph P. Pickett, executive editor of the American Heritage Dictionary; and Dale Hoiberg, editor-in-chief of Encyclopaedia Britannica.

The work was funded by Google, a Foundational Questions in Evolutionary Biology Prize Fellowship, Harvard Medical School, the Harvard Society of Fellows, a Fannie and John Hertz Foundation Graduate Fellowship, a National Defense Science and Engineering Graduate Fellowship, a National Science Foundation Graduate Fellowship, the National Space Biomedical Research Institute, the National Human Genome Research Institute, the Templeton Foundation, the National Institutes of Health, and the Bill and Melinda Gates Foundation.

Links to the data and browser are available at www.culturomics.org.

###

This article is based on a press release from the Harvard Faculty of Arts and Sciences (FAS).

Topics: Computer Science