Using class imbalance learning for software defect prediction

Shuo Wang; Xin Yao

doi:10.1109/TR.2013.2259203

Using class imbalance learning for software defect prediction

Shuo Wang, Xin Yao

Computer Science

Research output: Contribution to journal › Article › peer-review

415 Citations (Scopus)

Abstract

To facilitate software testing, and save testing costs, a wide range of machine learning methods have been studied to predict defects in software modules. Unfortunately, the imbalanced nature of this type of data increases the learning difficulty of such a task. Class imbalance learning specializes in tackling classification problems with imbalanced distributions, which could be helpful for defect prediction, but has not been investigated in depth so far. In this paper, we study the issue of if and how class imbalance learning methods can benefit software defect prediction with the aim of finding better solutions. We investigate different types of class imbalance learning methods, including resampling techniques, threshold moving, and ensemble algorithms. Among those methods we studied, AdaBoost.NC shows the best overall performance in terms of the measures including balance, G-mean, and Area Under the Curve (AUC). To further improve the performance of the algorithm, and facilitate its use in software defect prediction, we propose a dynamic version of AdaBoost.NC, which adjusts its parameter automatically during training. Without the need to pre-define any parameters, it is shown to be more effective and efficient than the original AdaBoost.NC.

Original language	English
Article number	6509481
Pages (from-to)	434-443
Number of pages	10
Journal	IEEE Transactions on Reliability
Volume	62
Issue number	2
DOIs	https://doi.org/10.1109/TR.2013.2259203
Publication status	Published - 2013

Keywords

Class imbalance learning
ensemble learning
negative correlation learning
software defect prediction

ASJC Scopus subject areas

Safety, Risk, Reliability and Quality
Electrical and Electronic Engineering

Access to Document

10.1109/TR.2013.2259203

Cite this

@article{31f3d4dd718846f5b8efce9f3a56a4d4,

title = "Using class imbalance learning for software defect prediction",

abstract = "To facilitate software testing, and save testing costs, a wide range of machine learning methods have been studied to predict defects in software modules. Unfortunately, the imbalanced nature of this type of data increases the learning difficulty of such a task. Class imbalance learning specializes in tackling classification problems with imbalanced distributions, which could be helpful for defect prediction, but has not been investigated in depth so far. In this paper, we study the issue of if and how class imbalance learning methods can benefit software defect prediction with the aim of finding better solutions. We investigate different types of class imbalance learning methods, including resampling techniques, threshold moving, and ensemble algorithms. Among those methods we studied, AdaBoost.NC shows the best overall performance in terms of the measures including balance, G-mean, and Area Under the Curve (AUC). To further improve the performance of the algorithm, and facilitate its use in software defect prediction, we propose a dynamic version of AdaBoost.NC, which adjusts its parameter automatically during training. Without the need to pre-define any parameters, it is shown to be more effective and efficient than the original AdaBoost.NC.",

keywords = "Class imbalance learning, ensemble learning, negative correlation learning, software defect prediction",

author = "Shuo Wang and Xin Yao",

year = "2013",

doi = "10.1109/TR.2013.2259203",

language = "English",

volume = "62",

pages = "434--443",

journal = "IEEE Transactions on Reliability",

issn = "0018-9529",

publisher = "Institute of Electrical and Electronics Engineers (IEEE)",

number = "2",

}

TY - JOUR

T1 - Using class imbalance learning for software defect prediction

AU - Wang, Shuo

AU - Yao, Xin

PY - 2013

Y1 - 2013

N2 - To facilitate software testing, and save testing costs, a wide range of machine learning methods have been studied to predict defects in software modules. Unfortunately, the imbalanced nature of this type of data increases the learning difficulty of such a task. Class imbalance learning specializes in tackling classification problems with imbalanced distributions, which could be helpful for defect prediction, but has not been investigated in depth so far. In this paper, we study the issue of if and how class imbalance learning methods can benefit software defect prediction with the aim of finding better solutions. We investigate different types of class imbalance learning methods, including resampling techniques, threshold moving, and ensemble algorithms. Among those methods we studied, AdaBoost.NC shows the best overall performance in terms of the measures including balance, G-mean, and Area Under the Curve (AUC). To further improve the performance of the algorithm, and facilitate its use in software defect prediction, we propose a dynamic version of AdaBoost.NC, which adjusts its parameter automatically during training. Without the need to pre-define any parameters, it is shown to be more effective and efficient than the original AdaBoost.NC.

AB - To facilitate software testing, and save testing costs, a wide range of machine learning methods have been studied to predict defects in software modules. Unfortunately, the imbalanced nature of this type of data increases the learning difficulty of such a task. Class imbalance learning specializes in tackling classification problems with imbalanced distributions, which could be helpful for defect prediction, but has not been investigated in depth so far. In this paper, we study the issue of if and how class imbalance learning methods can benefit software defect prediction with the aim of finding better solutions. We investigate different types of class imbalance learning methods, including resampling techniques, threshold moving, and ensemble algorithms. Among those methods we studied, AdaBoost.NC shows the best overall performance in terms of the measures including balance, G-mean, and Area Under the Curve (AUC). To further improve the performance of the algorithm, and facilitate its use in software defect prediction, we propose a dynamic version of AdaBoost.NC, which adjusts its parameter automatically during training. Without the need to pre-define any parameters, it is shown to be more effective and efficient than the original AdaBoost.NC.

KW - Class imbalance learning

KW - ensemble learning

KW - negative correlation learning

KW - software defect prediction

UR - http://www.scopus.com/inward/record.url?scp=84878691303&partnerID=8YFLogxK

U2 - 10.1109/TR.2013.2259203

DO - 10.1109/TR.2013.2259203

M3 - Article

AN - SCOPUS:84878691303

SN - 0018-9529

VL - 62

SP - 434

EP - 443

JO - IEEE Transactions on Reliability

JF - IEEE Transactions on Reliability

IS - 2

M1 - 6509481

ER -

Using class imbalance learning for software defect prediction

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Fingerprint

Cite this