What do scientists read when they don’t think anyone is looking? Is it possible to anticipate emerging areas of research before they exist? If we could take a real-time snapshot of innovation, what would it look like? For the first time, we may now have some answers.
Picture this: the whole of human knowledge as a figurative mind that can selectively focus on certain areas. It’s a profound notion, and visualizing such a construct is an enormous undertaking. But with last week’s release of a new “map of science,” a team of researchers led by Johan Bollen is attempting to do just that — with a high-resolution visualization of how scientific literature is accessed based on users’ downloading and browsing behavior, known as clickstream data. This usage data was collected, aggregated, and normalized across a wide variety of journal publishers and institutions. The result is a network map with color-coded nodes (clusters of research articles from different fields) and interconnected lines (shaped by users’ clickstreams), demonstrating the connections among a comprehensive sample space of scholarly research.
A new map of science based on clickstream data. Click to enlarge. Credit: PLoSOne
This isn’t the first attempt to extract meaning from the referential loops within scientific literature. In 2006, Columbia University’s W. Bradford Paley released an influential map of science based on data from Thomson Scientific, a firm that tracks article citations across scholarly journals. More recently, Carl Bergstrom, a biologist from the University of Washington, has developed a suite of innovative visualizations based on his own citation data sets for a venture called Eigenfactor. His method draws from network science and information theory to determine how often specific articles cite other articles as part of a relative ranking system for journals.
What makes the new map special is its use of clickstream data. “Bollen has shown clearly that people’s downloading behavior provides a very strong signal in terms of the structure of scientific endeavors,” says Bergstrom, a pioneer in visualizing trends in scientific research. “It was an open question until they did this work.”
Bollen, an experimental psychologist who studies network theory at Los Alamos National Laboratory, maintains that there are several important differences in looking at user behavior versus citation records. Most critical is that citation data takes a long time to become available. “Let’s say a researcher has an initial idea, starts to work on it, conducts a study, writes it up, and then tries to get it published,” he says. “Once it finally does get published, in order for a citation to take place, another article has to be written — the process repeated — and then it has to cite that first article.” A significant lag time for anyone trying to get a leg up on shifting areas of research. “It’s like looking at a galaxy that’s 50 million light-years away,” Bollen says. “You’re looking at a galaxy that existed 50 million years ago by the time its light reaches you.” But clickstream data, he notes, “captures those seminal stages of idea generation, as interest in a particular area develops.”
Although Bergstrom uses citation data for his visualizations, he acknowledges that lag is a reality of such records. It takes a long time, he says, before an idea is recorded in citation patterns. “But if you want a really early indicator of a trend in scientific research,” says Bergstrom, “you’re going to find it in clickstream data.”
Inherent in citation data may also be what psychologists call a social desirability bias. Scientists tend to cite the same articles from the same top journals written by the same big-name authors — and rarely cite outside their specific field. “When scientists cite publicly, they act very differently than when they’re just looking at the literature and following their true interests. As a scientist, I’m guilty of it myself,” Bollen admits. “A good chunk of your citations are intended to demonstrate to your colleagues that you know what’s going on in your domain.” This tendency causes a “narrowing” artifact within the literature and may inflate the appearance of scientific consensus. The trend seems to be continuing, despite the increasing number of research articles overall, as shown last year by James Evans, a sociologist from the University of Chicago.
The worth and importance of a scientist’s research is often determined by the prestige of the journals in which he or she has published. That prestige is typically based on the publication’s impact factor, which is in turn calculated by citation data. Bollen is excited about the possibility of moving beyond simple citation counts and the potential of these new maps to serve as the underpinnings of a better scholarly evaluation system.
Another striking difference in the new approach is the sample space: The data set used by Bollen’s team is massive. “If you look at what people are reading and the relationships that it indicates, you end up with a sample space that is much bigger, much more varied, much more diverse than anything you could achieve with citation data.” In fact, the total mass of article citations since the beginning of science publishing is around 650 million; on Elsevier alone — one of hundreds of journal publishers — articles have been viewed or downloaded more than a billion times.
Both Bergstrom and Bollen are careful to note that although their approaches are different, clickstream and citation data are each useful tools. In fact, the two researchers hope to eventually collaborate to leverage the utility of both types of data sets in an attempt to define the relationship between what researchers are reading and what ultimately leads to successful studies. “When we see a buildup of activity at a novel intersection of research, that activity is seeking an outlet that isn’t there,” Bollen says. Such buildups might serve as useful tools for determining the potential value of establishing a new journal or allocating resources to help focus burgeoning fields. Bollen sees this as one of the map’s most promising applications: “By identifying these areas and mathematically defining the probability that they would provide a good outlet for that buildup of interest, you can predict the emergence of new ideas.”
Bergstrom agrees that understanding the relationship between the two types of data is an exciting next step. “Clickstream data still doesn’t tell you about what areas of research turned out to be fruitful. It just indicates where people are looking, whereas citation data represents what actually panned out,” Bergstrom says, adding that both are needed to truly understand the enterprise of science.
There are many stories one can tell by tracing the epistemological branches of this new map. For instance, it’s apparent that biology slowly merges with the social sciences and humanities; it gradually becomes biodiversity and ecology, before finally connecting to architecture and design. But perhaps the most interesting story is the one not yet told — the unexpected breakthrough inspired by a scientist reaching out beyond his own field. And it’s certainly possible that eavesdropping on what researchers are reading will act as some sort of innovation bellwether, identifying and facilitating aha moments before they happen.
Originally published March 20, 2009