What use is a library without a librarian, or an encyclopedia without an index? Scale that prospect up to the realm of web analytics, astronomy, high-speed finance, or even basketball statistics, and the problem becomes clear.
As research scientist Fernando Pérez put it, “Regardless of the amount of data we have… we still only have two eyeballs and one brain.”
Pérez, of the University of California, Berkeley, spoke January 24 at a symposium titled “Weathering the Data Storm: The Promise and Challenges of Data Science” hosted by the Institute for Applied Computational Science (IACS) at the Harvard School of Engineering and Applied Sciences (SEAS). The annual symposium marks the culmination of two weeks of events at IACS called ComputeFest.
Leaders from academia and a range of industries spoke about the power of computational science and engineering to solve real-world problems.
For example, the Manhattan power grid contains 21,000 miles of underground electrical cable, some of which is 130 years old. Given the human and financial costs of major outages, proactive maintenance becomes more important as the system ages. MIT statistician Cynthia Rudin described how she collected and analyzed data on these cables, manholes, inspections, and “trouble tickets” to generate a robust model that is currently the best predictor of power failures in New York City.
Another symposium presenter described a more desperate problem.
Humanitarians at UNICEF periodically send text messages to 245,000 Ugandans to solicit information about the state of their nation. When they asked, “Have you heard of any children being sacrificed in your community?” they received a chilling array of responses: some “yes,” some “no,” and a flood of cries for help.
UNICEF’s Ureport system of weekly polls gathers essential data on vulnerable populations in order to guide its outreach and direct limited resources to the people who need them most. The incoming text messages sometimes report famines, floods, Ebola outbreaks, evictions, and dried-up water sources—often begging for assistance.
“There just aren’t enough humans to read all of these messages and try to determine: is this something that requires immediate action?” said Bonnie Ray, director for cognitive algorithms at IBM’s T. J. Watson Research Center. Ray’s team worked with UNICEF to optimize the process of sorting and prioritizing the messages. The new system parses spelling errors, uses common word associations to understand synonyms, and incorporates conditional probability techniques to make intelligent assessments that quickly put the most urgent messages in front of the people who need to see them.
The information filtered by this system may not constitute “big data” on the scale of Facebook or Google, Ray noted, but “it’s too much for a human to do, and it is having a real impact on the lives of Ugandans.”
As computing power allows non-profits, businesses, and researchers to gather ever-larger troves of information, new challenges arise—in privacy and security, for example. (As Google research scientist Diane Lambert noted, “If you’ve ever put a query into Google, then you’ve been in an experiment.”) Meanwhile, the demand for reliable software that can make sense of ever-larger and more complex data sets continues to grow—as does the need for well-educated analysts who can deftly weave together the worlds of computer science, statistics, and other disciplines. This new breed of "data scientist" can not only guide important decisions, but also provide new tools for scientific inquiry or recognize hidden patterns in human behavior, demographics, and epidemiology.
“The underlying methods can be familiar techniques such as logistic regression or [Bayesian statistics], techniques that have been part of the standard statistics and machine learning curriculum for a long time,” explained Rachel Schutt, senior vice president of data science at the media conglomerate News Corp. Yet the vast scale of the data, the need for real-time analysis and implementation, and the way business decisions rapidly feed back into the data stream are all recent developments and require new types of experts, disrupting the traditional notion of a “quantitative analyst” or “statistician.”
“It’s a challenge for a lot of people working in these fields to welcome data science,” Schutt added, “but it also comes with a lot of promise.”
“It’s exciting to be present at the birth of a new discipline, not quite yet defined,” says SEAS Dean Cherry A. Murray, who established IACS in 2010 in response to several catalysts: “We are experiencing the convergence of ubiquitous computing power and cloud services at the same time that the connectivity of the Internet and the microelectronics revolution are enabling us to collect, store, interact with, and learn from massive streams of raw data,” she says.
At Harvard, rigorous scholarship in machine learning, advanced computational techniques, algorithms, and visualization are converging with studies in statistics, social science, and the humanities. “With knowledge from across these areas, graduates have the opportunity to inform decision making in science, business, or government settings, greatly enhancing our understanding of nature and of society,” says Murray.
Speakers at the symposium presented some tools that live-Tweeters in the crowd called “mind-blowing.”
Pérez wowed the audience with IPython, a comprehensive tool for streamlining the entire analysis process, from data exploration to publication. Jeffrey Heer, associate professor of computer science at the University of Washington, provided a tour of Data Wrangler, a clever cleanup tool for messy data sets. And Ryan Adams, assistant professor of computer science at Harvard SEAS, extolled the virtues of Bayesian optimization.
Adams raised a concern that seemed to resonate with the audience: As computational tools become more sophisticated, the field of data science risks alienating non-experts. Investigative journalists, for instance, have much to gain from accessible research tools.
Likewise, several speakers noted, it is important for practitioners of computational science and engineering to be able to accurately and engagingly communicate the results of an investigation to others outside their field.
“There’s an element we can learn from journalists—hearing how they tell stories and investigate and ask questions, and how they find what’s actually interesting to other people,” explained Schutt. “It’s important in communicating about data [to know] exactly what’s objective and what’s subjective… and [to make] sure you’re transparent about the data collection process and your modeling process.”
“It does require some education,” agreed Heer, “and doing that hand in hand with basic quantitative skills as well is incredibly important.”
At Harvard SEAS, graduate students can pursue a one-year master of science or two-year master of engineering in computational science and engineering (CSE). Doctoral candidates in the Graduate School of Arts and Sciences can also take a secondary field in CSE. Undergraduates can take courses like “Data Science,” “Visualization,” “Data Structures and Algorithms,” “Introduction to Scientific Computing,” or “Statistics and Inference in Biology,” as part of their liberal arts coursework. And graduates with deep and broad skills—beyond just number-crunching—are high in demand.
In a changing economy, universities have a responsibility to foster these types of abilities in all students, says Murray, but there is another reason academia, not just business, must influence the evolution of data science.
“It is important to think deeply about and measure how ubiquitous computing and data are affecting society and our everyday lives, and how players in society interact to create social norms, disrupt old systems of social interaction and business models, and affect and interact with legal systems,” she explains. “This is why ‘data science’ cannot become its own narrow discipline, but will need to be intrinsically transdisciplinary—and why it is important for Harvard, in particular, to be focusing on the field.”
“Weathering the Data Storm” drew close to 500 attendees from Harvard, other Boston-area universities, and industry partners, as well as sponsorship from Liberty Mutual Insurance and VMware.
“The annual IACS symposium has become a cornerstone event for SEAS and Harvard,” says Hanspeter Pfister, An Wang Professor of Computer Science and director of IACS. “The impressive audience turnout and their active participation in the engaging panel discussions are compelling indications that there is a real interest in data science at Harvard and beyond.”
Videos from the event will be posted at http://computefest.seas.harvard.edu.