Abstract
During a mass disaster, social media are a major source of information providing first-hand accounts of the unfolding situation. Automated ways to discover and collate this information in real-time can be of critical value for humanitarian operations. Prior work on this task largely focused on developing message classifiers restricted to particular types of disasters, such as storms or wildfires. In this paper we investigate machine-learning methods to detect crisis-related messages where the type of the crisis is not known in advance. The methods are potentially of a much greater practical value, as they can provide the means to deal with a wide range of crisis situations, including those that involve combinations of disaster types and types that were unknown at the training stage. The key challenge with this task is the fact that events of potential relevance are extremely diverse and correspondingly both training and test data are highly heterogeneous. The data heterogeneity causes significant difficulties for machine learning algorithms to generalize and accurately label incoming data. Our main contributions are an investigation of the scope of this problem in the context of disaster management, and novel message classification methods to overcome data heterogeneity based on ensemble methods, semi-supervised learning and feature selection. We evaluate the proposed methods on an academic benchmark dataset comprising twenty-six different disaster events, as well as in a case study where we assess the performance of the methods on real-world data. The experimental evaluation shows that the methods achieve quality of classification superior to methods previously used for this task.
Original language | English |
---|---|
Journal | Journal of the Association for Information Science and Technology |
Volume | 71 |
Issue number | 1 |
Publication status | Submitted - 30 Apr 2017 |
ASJC Scopus subject areas
- Artificial Intelligence
- Social Sciences (miscellaneous)