On the Performance of Oversampling Techniques for Class Imbalance Problems

Jiawen Kong*, Thiago Rios, Wojtek Kowalczyk, Stefan Menzel, Thomas Bäck

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

Although over 90 oversampling approaches have been developed in the imbalance learning domain, most of the empirical study and application work are still based on the “classical” resampling techniques. In this paper, several experiments on 19 benchmark datasets are set up to study the efficiency of six powerful oversampling approaches, including both “classical” and new ones. According to our experimental results, oversampling techniques that consider the minority class distribution (new ones) perform better in most cases and RACOG gives the best performance among the six reviewed approaches. We further validate our conclusion on our real-world inspired vehicle datasets and also find applying oversampling techniques can improve the performance by around 10%. In addition, seven data complexity measures are considered for the initial purpose of investigating the relationship between data complexity measures and the choice of resampling techniques. Although no obvious relationship can be abstracted in our experiments, we find F1v value, a measure for evaluating the overlap which most researchers ignore, has a strong negative correlation with the potential AUC value (after resampling).

Original languageEnglish
Title of host publicationAdvances in Knowledge Discovery and Data Mining - 24th Pacific-Asia Conference, PAKDD 2020, Proceedings
EditorsHady W. Lauw, Ee-Peng Lim, Raymond Chi-Wing Wong, Alexandros Ntoulas, See-Kiong Ng, Sinno Jialin Pan
PublisherSpringer Vieweg
Pages84-96
Number of pages13
ISBN (Print)9783030474355
DOIs
Publication statusPublished - 2020
Externally publishedYes
Event24th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2020 - Singapore, Singapore
Duration: 11 May 202014 May 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12085 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference24th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2020
Country/TerritorySingapore
CitySingapore
Period11/05/2014/05/20

Bibliographical note

Publisher Copyright:
© Springer Nature Switzerland AG 2020.

Keywords

  • Class imbalance
  • Data complexity measures
  • Minority class distribution

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'On the Performance of Oversampling Techniques for Class Imbalance Problems'. Together they form a unique fingerprint.

Cite this