Professor Yuval Shavitt, of Tel Aviv University’s School of Electric Engineering, is melding math and sociology to describe mass behavior on the Internet. He is the principal investigator of DIMES, a project that hopes to map the structure and topology of the Internet, begun four years ago. And for the past year, he has used data-mining tools to collect and interpret massive amounts of data from file-sharing networks. By applying a decades-old sociological theory that describes the spread of information in social networks to the online world, he has been able to develop a predictive algorithm that identifies musicians who will ascend from local popularity to national stardom.
Shavitt and a team of graduate students developed their algorithm first by collecting half a billion search-query strings from Gnutella, a peer-to-peer file-sharing network. Non-music-related searches and searches for already-popular musicians are eliminated, and the remaining queries are tagged and sorted by the specific city or region from which the queries originated, using IP addresses. These searches are dubbed “geo-aware query strings.”
The geographic location of an emerging artist is the key to predicting their success, explains Shavitt. “If an artist has the potential to be successful, people will first start noticing them in the small geographical area where they live and perform.” In fact, a potential pop star will typically enjoy thousands of downloads a day on a local level, while remaining relatively unheard of on a national level. A large divergence between local and global popularity, called the Kullback-Leiber divergence, is a strong indicator of star potential. The algorithm measures the K-L divergence to produce a short list of potentials, of which 15 to 30 percent will go on to reach national popularity within weeks.
According to Shavitt’s paper on the subject, “Spotting Out Emerging Artists Using Geo-Aware Analysis of P2P Query String,” presented at the International Conference on Knowledge Discovery and Data Mining last August in Las Vegas, his predictive algorithm is based on the groundbreaking sociological theories of Mark Granovetter, who first described in the 1970s how micro-level interactions between individuals affect macro-level phenomena. From Granovetter’s work emerged the small-world model, which is able to predict a product’s success based on its adoption by a small network of people — assuming that the “main driver behind a product’s growth is communication between individuals.”
The use of geo-aware peer-to-peer query strings presents a potentially major shift in music hit-prediction software, most of which — like Hit Song Science — collects data on the sound of a song, then compares the melody, tempo, and lyrics for example of a potential hit to a database of established hits. “Our algorithm never hears the actual song; it is based on the Internet mirroring of the social word of mouth of people spreading their interest in the song,” says Shavitt. “It will be interesting to compare the success rates of both approaches.”
But Shavitt’s algorithm may have wider implications. He and his team of researches have been contemplating using the algorithm to predict the success potential of a homegrown politician, for example. Text encryption would be needed to data-mine searches on politicians, as their Internet presence is best measured in their popularity as discussion topics in forums and chat rooms. It’s much trickier to data-mine text, as compared with numbers — and to determine if what’s being written about public figures online is positive — “but it’s certainly doable,” says Shavitt. With the growing sophistication and popularity of online social networking sites and file-sharing services, Shavitt demonstrates how math can describe and harness mass behavior in online environments. The applications of which could be endless.
Originally published December 22, 2008