Balancing Big Data and Deep Data
David Brooks has a nice column in the New York Times on what data can’t do. It is a welcomed attempt to temper the recent wild enthusiasm about “big data” and perhaps a more long-standing reliance in business and other fields (like security) to use data analysis to impress, comfort, or deceive share holders and citizens. Brooks notes the things that big data is not very good at: identifying masterpieces rather than trends; dealing with unreplicated events; and achieving a holistic understanding of a complex situation. On the other hand, we have seen, especially recently, the fabulous things that can be done with big data—solving longstanding riddles, finding lost needles (Brooks references Nassim Taleb claiming that big data gives us many more haystacks, which is true, but we now have much greater power to find the needles, so sometimes the number of haystacks are irrelevant), and exposing much wider audiences to ideas and art that likely would have been obscured in the past (fans or detractors of Psy’s “Gangnam Style” might argue whether this is a good or bad thing).
In Learning from the Octopus I mostly look at the benefits of massive, decentralized data gathering—by organisms in nature, or by networked organizations of people. To varying extents biological organisms including humans, process massive amounts of observational data and use these data to construct subconscious patterns and scenarios. How our immediate reality coincides with or deviates widely from these patterns leads us to make changes (in how we move, where a quarterback throws a ball, how a dog determines whether to trust someone, etc.) or adapt to a novel situation. But I also note that organisms in nature, for 3.5 billion years, haven’t been wasting their time trying to use data to predict into the far and uncertain future.
This far and uncertain future is where holistic pictures of complex situations emerge. For example, our understanding of recent climate change and our limited predictions of what is to come, could not be developed solely on massive climatological databases. There is, for instance, also the question of how the biosphere has been reacting to these climate changes (itself now a large and growing database of “fingerprints” of climate impact), and how human behaviors and economic decisions in the past and future will shape the climate system. Right now, we have a lot of these data, but not nearly enough based on the data alone, to make a firm decision about what we should do. But that doesn’t mean we should do nothing.
As I argue in the book, nature has multiple options when confronted with challenges that its own internal database can’t handle. For example, an organism can create a symbiotic partnership with another organism. Such symbioses will be essential in solving climate and other complex challenges, even where we lack all the data we’d want. One small example: a project colleagues of mine at the University of Arizona have started with Navajo Indians to install solar membrane distillation desalinization pumps where once windmills pumped water so saline that not even cows would drink it, and tribe members had to drive an average of 40 miles to bring back bottled water to their homes. The project benefits the Navajo nation whether or not particular data driven scenarios about climate change in northeastern Arizona are accurate, and it benefits the University that wants to demonstrate its prowess in renewable energy technologies R&D. In other words, symbiosis helps partnering organisms (or organizations) solve complex problems without needing all the data.
This balance of mining data in unprecedented and amazing ways versus the need to understand a situation holistically in order to solve complex challenges is the driving the transformation of scientific inquiry, as I describe in my other recent book, Observation and Ecology: Broadening the Scope of Science to Understand a Complex World (2012, Island Press). In that book, I show how life scientists are now combining massive, automated and technological data gathering and analysis capabilities with the old fashioned practice, long dismissed in scientific circles, of simple natural history—what paleontologist Geerat Vermeij calls “observation with the brain in gear” in his contribution to the book. This combination of broad knowledge and deep knowledge, of technical and human capabilities, gives us our best chance at understanding an unpredictable and rapidly changing world.