Could you explain the goals and focus of DataSpace?
The project's main objective, initially, is to enable scientists to easily access, aggregate, and reuse data across disciplines. The motivation behind the initiative was the tremendous amount of scientific information that is increasingly being captured and stored whether that's happening automatically through sensors or through people typing on keyboards. In the past, that information wasn't saved often because it was simply too expensive. When I worked for IBM back in the 1960s, their first disk drive for the IBM 360 mainframe was the model 2311—which held about 10 megabytes and was the size of a washing machine. Clearly, thanks in part to companies like EMC, the world has changed dramatically in those few decades.
However, accessibility to scientific data hasn't advanced as much, in many ways. We have more and more information that we know less and less about. For instance, two blocks from my office is the Broad Institute, a genomics research center founded by MIT and Harvard. They now generate about a petabyte of data (about 100 million of those IBM 2311 disk drives) per year, and the growth rate of data has been rapidly increasing. In some scientific fields that rate of growth is even higher. But for researchers, especially those outside a given organization, it is still not easy to locate and reuse that information. We have several centuries of experience in publishing scientific conclusions, but the voluminous data that backs them up is still not published or made available in a systematic way.
Is the DataSpace initiative of interest only to the scientific community?
This is a wide-ranging project that will address many information management issues. For instance, how is information to be identified and shared? What kind of policies do people want to have in terms of sharing and protecting information? What's the best way to manage complexity? Is it possible to harness collective intelligence? The answers to these types of questions will have important implications beyond access to scientific data.
As part of this initiative we formed a DataSpace advisory board. One of the members is Dan Schutzer, executive director of the Financial Services Technology Consortium. The members of that consortium don't "do" science, but they do have a need to store and retrieve tremendous amounts of financial data. We think DataSpace will also be useful for nonscientific data, and the nonscientific applications may actually help underwrite the needed infrastructure the same way that the Internet, which started with solely a scientific focus, has become valuable in many other domains.
Why did you and MIT get interested in this problem?
Many scientists argue that important scientific advances are limited by the current diversity of data formats, management and access policies, tools for visualization and use, preservation strategies, and inability to easily extract needed subsets from enormous data collections. The DataSpace infrastructure will empower researchers to effectively utilize all relevant data sources.
Several years ago MIT led the effort to develop DSpace, which was designed to provide MIT faculty and researchers with stable, long-term storage for their digital research—mostly in the form of documents. That open source effort has been so successful it has been adopted globally by more than 500 academic and research organizations. DataSpace is the logical next step to address similar needs for scientific data.
At MIT we believe in openness though we realize in practice that there may be reasons to limit openness. DataSpace will incorporate a rule-based system whereby data can be made available to the extent its originators feel is acceptable: It can be as limited or as open as they want. It would also help define the quality of the data and how it was obtained so that people will know if they are comparing apples to apples.
How would DataSpace help its users understand the context in which the data was created?
A key aspect of DataSpace is the use of metadata—data about the data. So a key task is building up a context definition that gives us the ability to define exactly what we mean and then refine further. We think it is very important because even with scientific data, we tend to take a lot for granted, and when you take the data out of its context-whether that's out of an institution or project or to another country or another field-those assumptions can get lost or misunderstood.
So we have looked extensively at data semantics. For example, we have found that in geographic data, there are many different coordinate systems in use. Even the U.S. Army has two different coordinate systems in place, one used by the artillery command and the other by the missile command. As you might imagine, these issues can produce significant problems. There are many such issues, ranging from simple ones such as different measuring systems (e.g., meters vs. feet) to much more complex ones.
Are you going to focus initially on specific disciplines?
We're going to start with life sciences, specifically neuroscience, and energy/environmental sciences. In fact, we're looking at the diverse data used in the study of climate change. But one of the powerful things we anticipate will happen once we start with these specific disciplines is the ability to go across fields and put together connections that might not be obvious. DataSpace will bolster what Tim Berners-Lee, the "father of the Web," calls the serendipitous reuse of data—where a third party could see new value in existing data.
How is the National Science Foundation involved with DataSpace?
The DataSpace initiative is motivated by the National Science Foundation (NSF)'s Sustainable Digital Data Preservation and Access Network Partners (DataNet) initiative. NSF intends to fund five experiments, and we hope that DataSpace is one of them. A key issue for NSF, and something that is an important part of our proposal, is long-term sustainability: The infrastructure we'll develop will be deployable by any organization that manages data and will be scalable over time.
NSF's plan calls for three stages. The first five years will focus on building up the infrastructure and getting at least our two pilot domains well established. The next five years would be transitional, with NSF funding on a declining scale. After 10 years, the NSF would like to walk away from the project and assume that there will be enough people in the scientific community and beyond to support it.
What do you see as your biggest challenge?
There are many challenges, but implementation is a big one—even great ideas do not always work out. The good news is that we're not inventing anything for which there's no precedent, but a project of this scope, which combines all these ideas, has never been done before. We have faith that we can do it, but it's not a trivial thing.
