Charles Curran, a physicist who recently retired as the longtime storage consultant at CERN, remembers the old days of data access: when filling a request from a researcher was often a labor-intensive, daylong misadventure.
In the 1970s, information from CERN’s accelerators and experiments was stored on tapes, held in a huge library in the IT department, originally retrieved manually by operators and then copied to disk for the researcher. Overworked operators fell asleep, went missing for hours at a time, invented trickery to make the machines work faster, and overloaded the conveyor belts, causing tapes to fall off and disappear. Tape-retrieval robots squared off against mice (in one documented case, the mouse was found months later, desiccated) or overheated when they couldn’t reach tapes, melting their wheels in frustration. A request to see a certain tape often took 24 hours to fill.
Now the wait is about two minutes, hardly enough time to get a cup of coffee.
Accessing and processing data is now faster, more flexible, more reliable, and cheaper. A researcher in Croatia can reach and exchange data, in a variety of formats, with a colleague in Argentina almost immediately, 24 hours a day, seven days a week, without leaving her desk or going up against any rogue mice.
In the past decade, the public research community, the European Commission, the US, and other countries’ governments have invested heavily in game-changing data infrastructure known as “grid computing.” A grid is a network for sharing computer power and data-storage capacity over the internet. It goes well beyond simple communication between computers, ultimately aiming to turn the global network of computers into one vast resource for solving large-scale computer- and data-intensive applications. Grid computing is often compared to the concept of an electric power grid in which the power generators are distributed; in a computational grid, users can access computing power without regard for the source of energy or its location. A key element of grid computing is that it enables real-time collaboration between geographically dispersed communities in the form of virtual organizations.
In the next decade, we must invest even more heavily in such technology. Data is fundamental to science, and the science we do now requires ever-increasing data sets. We need flexible, powerful computing systems to support this data.
How did we get here? Computing grids were in their infancy in the late 90s, when the collaborations around the Large Hadron Collider (LHC) shifted focus to its computing needs. Plans for information technology needs are often looked at last in projects like this because, while you can trust that computing will be more advanced, you don’t know what form that advancement will take by the time your machine, satellite, or observatory is ready.
However, for the LHC there was another problem. Funding for computing wasn’t included in the original costs. (The logic was that this couldn’t be estimated accurately, so it wasn’t estimated at all.) By 1999 all of the money CERN had received for the LHC was needed to build the machine itself.
With little funding available at CERN or elsewhere, the computing system would have to be distributed. A single organization could never find the money to do it alone. At around that same time, Carl Kesselman and Ian Foster were proposing the fundamental ideas of the grid—connecting distributed processors into a kind of supercomputer. CERN decided to take a closer look.
This led to a truly novel system. Distributed computing had existed before, but it didn’t look like this. The LHC’s grandfather, the Large Electron–Positron Collider, used a distributed computing system in the late 1980s. But the system was custom built, fixed in size, and used only for physics. There were a few sites for each experiment, and once it was finished one couldn’t add new sites at will. With grid computing, the system is dynamic. If one site goes down, the system switches to a different one. And the same computers are used for Earth sciences, life sciences, humanities research, as well as physics. The system is managed by the project Enabling Grids for E-science (EGEE).
As politicians and citizens begin to take global warming more seriously, it has never been more important to have powerful and accurate climate information. Elaborate computer models are the primary tool used by climate scientists and bodies like the Intergovernmental Panel on Climate Change (IPCC) to report on the status and probable future of our Earth.
Models like those used by the IPCC need data from the atmosphere, land surface, ocean, and sea ice, all originating from different communities, along with large amounts of accompanying metadata describing that data. The amount of data that climate scientists need to manage is enormous—in the petascale—yet a broad and global community needs to be able to access and analyze it.
A grid solution, where information stored around the world can be woven together without moving databases, is ideal. In fact, no other solution is immediately apparent: The data is too heavy to be centrally located. The Earth System Grid Federation—funded by federal agencies like the US Department of Energy, National Science Foundation, and National Oceanic and Atmospheric Administration—will provide the data-sharing infrastructure that will enable global analysis of the climate models used by the IPCC in its next assessment.
While the needs of the physics, life science, and Earth science communities originally drove the development of grids like EGEE, now that the technology is in place, disparate communities across the globe are using it to examine questions they could never have addressed before. Nick Malleson, a researcher at Leeds University, forecasts burglary rates using the UK’s National Grid Service. The ASTRA project and its Lost Sounds Orchestra revive long-gone instruments such as the epigonion, barbiton, syrinx, salpinx, and aulos. By using grid-computing techniques to model the instruments, they can approximate sounds not heard for centuries. UNOSAT, a cooperative project between the United Nations Institute for Training and Research Operational Satellite Applications Program and CERN, delivers satellite images to relief and development organizations. With processing from grid technology, some of these maps track pirate activity off the Horn of Africa.
The fundamentals of grid computing, first developed to allow complex physics projects, has led to a related technology known as cloud computing, heavily virtualized distributed computing, which has been adopted for many commercial applications. The public may not know they are using a cloud—but they are. Online banking, photo-sharing sites like Flickr, and web-based email are all examples of heavily virtualized services that exist “out in the cloud.” Making a full cycle, grid computing itself is adopting aspects of cloud technology, making more use of virtualization and setting up grid sites in the cloud. However, true grid infrastructure still excels at collaborative sharing of resources belonging to different institutions; clouds spread the resources of one domain to the rest of the world for remote access. Collaboration is the basis of all the large-scale scientific challenges (e.g., CERN has 20 member states). Projects like the LHC are too big for any one organization or one country to do alone; collaboration is the only option. The same holds for the major challenges facing society across other disciplines (energy, climate change, food production).
Now that we have excellent ways to reach and share data, we have a whole new set of problems, albeit more sophisticated. Who owns freely shared data? How long should it be kept? What besides the data must be kept so we can use it? Who pays for the energy to store data? How can researchers or disciplines resistant to sharing—afraid their ideas will be poached—be encouraged in a “publish or perish’’ world? A number of security questions also come to the fore: What happens to the data if companies running clouds go bust? Who is allowed to view data? Should all countries have access to e-infrastructures, regardless of their politics? These infrastructures are potential targets—how do we safeguard them?
As e-infrastructures have become the lifeblood of modern science and society, funding agencies, governments, and policy panels have many urgent issues to address. But given the choice, I wouldn’t swap this set of problems for the old.
Bob Jones is project director of the European Commission–funded Enabling Grids for E-science project. Danielle Venton contributed to this piece.
Originally published November 25, 2010