While many social media users think of services like Twitter and Facebook as just places to interact with friends, skim headlines or follow celebrities, computer scientists are developing new uses for the mountains of data spilling forth from social media.
UW-Madison researchers are finding ways to harness social media for social good through machine learning techniques. Machine learning experts develop mathematical tools to help computers learn from data and detect patterns.
Department of Computer Sciences students Shike Mei, Han Li and Jing Fan have analyzed Sina Weibo—a Twitter-like site that is China’s most popular social media outlet—to uncover real-time information about air-pollution levels in Chinese cities. The trio of second-year graduate students is working with professors Jerry Zhu and Chuck Dyer.
Air pollution is a serious issue in China. Between 350,000 and half a million Chinese citizens die prematurely each year because of air pollution, according to a commentary published in the medical journal The Lancet by a former Chinese health minister in 2013. Lung cancer is on the rise, even as smoking rates decrease.
While large Chinese cities have physical monitoring stations to gauge air pollution levels, smaller cities generally do not due to the expense of establishing and maintaining them—and, of course, China is a vast and diverse country. Measurements track what are known as PM2.5 particulates, or particles smaller than 2.5 microns, which can be aspirated deep into lungs.
In a paper called “Inferring Air Pollution by Sniffing Social Media,” UW researchers showed how social media posts could be used to estimate levels of air pollution in real time and with significant accuracy, all without the use of physical monitoring devices.
The team monitored Weibo posts to see how much people complained about the air. The group considered the text content of the posts, as well as a time-and-space correlation among cities and days, since pollution flare-ups typically cover large amounts of territory and can last for days.
Mei and his fellow researchers took advantage of crowdsourcing principles—gathering information from a wide array of people—but employed a type of involuntary crowdsourcing. Says Professor Dyer, “This is crowdsourcing of a sort where people don’t have to volunteer to support a common goal. There’s an unintentional provision of information.”
Simply by accessing public posts, the researchers had a treasure trove of data at their disposal. The team gathered Weibo posts over the course of 30 days and 108 cities.
The group’s mathematical models did not use pre-selected keywords to analyze the text of Weibo posts. Rather, their machine-learning model analyzed all the vocabulary used, but assigned different weights to different words. While this approach does not forecast future air quality, it has been shown to provide accurate real-time information on the Air Quality Index (AQI) by checking against cities that do have physical monitoring.
The team’s findings were presented in Beijing in August 2014 at the ASONAM conference and published in the conference proceedings. (ASONAM is the IEEE/ACM International Conference on Advances in Social Network Analysis and Mining.)
For a student like Mei, the project is more than just an intellectual exercise in computer science. He’s personally connected to the topic. In the area of central China where he grew up, there is just one air quality monitoring station for an area where 60 million people live, he says.
“Anhui province, where I was born, is not very wealthy,” says Mei. “There’s not enough information about pollution, and sometimes people suffer from heavier air pollution. We wondered, how can we use a new information source to help people understand how severe the pollution around them is?”
Mei, like his fellow students Li and Fan, was an undergraduate at Beijing’s Tsinghua University. All three experienced air-quality problems during their college years, including a memorable winter break for Mei in which he remained in Beijing during a spike in pollution. “It felt really bad. It caused me to think about this problem and to want to use machine-learning techniques to tackle it,” he says.
While the team’s research has centered on a specific issue, it has much broader applicability. The research is supported in part by the National Science Foundation’s “EAGER: Discovering Spontaneous Social Events” initiative. EAGER stands for Early Concept Grants for Exploratory Research.
“Lots of tasks could be dealt with this way,” says Professor Zhu. And while there are negative aspects to what people can do with publicly-available data (for example, some poachers use GPS data in tourists’ social media posts to hunt animals), there are plenty of socially beneficial uses. That same data could be provided to officials to help them outwit poachers.
Closer to home, Professor Zhu's group is using Twitter data to study the prevalence and impact of bullying in the United States.
The UW team will continue its work on air pollution. A next step is to try to incorporate photos, in addition to text, in its predictive model. That’s a highly complex task that requires both machine-learning and computer vision techniques, but it holds great promise for a range of future applications.
[ Photo credit: iStockPhoto.com/user hxdbzxy ]