An Investigation of Cross-Project Learning in Online Just-In-Time Software Defect Prediction

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Standard

An Investigation of Cross-Project Learning in Online Just-In-Time Software Defect Prediction. / Tabassum, Sadia; Minku, Leandro L.; Feng, Danyi ; Cabral, George G. ; Song, Liyan.

42nd International Conference on Software Engineering (ICSE 2020). IEEE Computer Society Press, 2019.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Harvard

Tabassum, S, Minku, LL, Feng, D, Cabral, GG & Song, L 2019, An Investigation of Cross-Project Learning in Online Just-In-Time Software Defect Prediction. in 42nd International Conference on Software Engineering (ICSE 2020). IEEE Computer Society Press, 42nd International Conference on Software Engineering (ICSE 2020), Seoul, Korea, Republic of, 23/05/20.

APA

Tabassum, S., Minku, L. L., Feng, D., Cabral, G. G., & Song, L. (Accepted/In press). An Investigation of Cross-Project Learning in Online Just-In-Time Software Defect Prediction. In 42nd International Conference on Software Engineering (ICSE 2020) IEEE Computer Society Press.

Vancouver

Tabassum S, Minku LL, Feng D, Cabral GG, Song L. An Investigation of Cross-Project Learning in Online Just-In-Time Software Defect Prediction. In 42nd International Conference on Software Engineering (ICSE 2020). IEEE Computer Society Press. 2019

Author

Tabassum, Sadia ; Minku, Leandro L. ; Feng, Danyi ; Cabral, George G. ; Song, Liyan. / An Investigation of Cross-Project Learning in Online Just-In-Time Software Defect Prediction. 42nd International Conference on Software Engineering (ICSE 2020). IEEE Computer Society Press, 2019.

Bibtex

@inproceedings{6e5447cb6c044dd78cc2cd40868140d9,
title = "An Investigation of Cross-Project Learning in Online Just-In-Time Software Defect Prediction",
abstract = "Just-In-Time Software Defect Prediction (JIT-SDP) is concerned with predicting whether software changes are defect-inducing or clean based on machine learning classifiers. Building such classifiers requires a sufficient amount of training data that is not available at the beginning of a software project. Cross-Project (CP) JIT-SDP can overcome this issue by using data from other projects to build the classifier, achieving similar (not better) predictive performance to classifiers trained on Within-Project (WP) data. However, such approaches have never been investigated in realistic online learning scenarios, where WP software changes arrive continuously over time and can be used to update the classifiers. It is unknown to what extent CP data can be helpful in such situation. In particular, it is unknown whether CP data are only useful during the very initial phase of the project when there is little WP data, or whether they could be helpful for extended periods of time. This work thus provides the first investigation of when and to what extent CP data are useful for JIT-SDP in a realistic online learning scenario. For that, we develop three different CP JIT-SDP approaches that can operate in online mode and be updated with both incoming CP and WP training examples over time.We also collect 2048 commits from three software repositories being developed by a software company over the course of 9 to 10 months, and use 19,8468 commits from 10 active open source GitHub projects being developed over the course of 6 to 14 years. The study shows that training classifiers with incoming CP+WP data can lead to improvements in G-mean of up to 53.90% compared to classifiers using only WP data at the initial stage of the projects. For the open source projects, which have been running for longer periods of time, using CP data to supplement WP data also helped the classifiers to reduce or prevent large drops in predictive performance that may occur over time, leading to up to around 40% better G-Mean during such periods. Such use of CP data was shown to be beneficial even after a large number of WP data were received, leading to overall G-means up to 18.5% better than those of WP classifiers.",
keywords = "Software defect prediction, cross-project learning, transfer learning, online learning, verification latency, concept drift, class imbalance",
author = "Sadia Tabassum and Minku, {Leandro L.} and Danyi Feng and Cabral, {George G.} and Liyan Song",
year = "2019",
month = dec,
day = "8",
language = "English",
booktitle = "42nd International Conference on Software Engineering (ICSE 2020)",
publisher = "IEEE Computer Society Press",
note = "42nd International Conference on Software Engineering (ICSE 2020) ; Conference date: 23-05-2020 Through 29-05-2020",

}

RIS

TY - GEN

T1 - An Investigation of Cross-Project Learning in Online Just-In-Time Software Defect Prediction

AU - Tabassum, Sadia

AU - Minku, Leandro L.

AU - Feng, Danyi

AU - Cabral, George G.

AU - Song, Liyan

PY - 2019/12/8

Y1 - 2019/12/8

N2 - Just-In-Time Software Defect Prediction (JIT-SDP) is concerned with predicting whether software changes are defect-inducing or clean based on machine learning classifiers. Building such classifiers requires a sufficient amount of training data that is not available at the beginning of a software project. Cross-Project (CP) JIT-SDP can overcome this issue by using data from other projects to build the classifier, achieving similar (not better) predictive performance to classifiers trained on Within-Project (WP) data. However, such approaches have never been investigated in realistic online learning scenarios, where WP software changes arrive continuously over time and can be used to update the classifiers. It is unknown to what extent CP data can be helpful in such situation. In particular, it is unknown whether CP data are only useful during the very initial phase of the project when there is little WP data, or whether they could be helpful for extended periods of time. This work thus provides the first investigation of when and to what extent CP data are useful for JIT-SDP in a realistic online learning scenario. For that, we develop three different CP JIT-SDP approaches that can operate in online mode and be updated with both incoming CP and WP training examples over time.We also collect 2048 commits from three software repositories being developed by a software company over the course of 9 to 10 months, and use 19,8468 commits from 10 active open source GitHub projects being developed over the course of 6 to 14 years. The study shows that training classifiers with incoming CP+WP data can lead to improvements in G-mean of up to 53.90% compared to classifiers using only WP data at the initial stage of the projects. For the open source projects, which have been running for longer periods of time, using CP data to supplement WP data also helped the classifiers to reduce or prevent large drops in predictive performance that may occur over time, leading to up to around 40% better G-Mean during such periods. Such use of CP data was shown to be beneficial even after a large number of WP data were received, leading to overall G-means up to 18.5% better than those of WP classifiers.

AB - Just-In-Time Software Defect Prediction (JIT-SDP) is concerned with predicting whether software changes are defect-inducing or clean based on machine learning classifiers. Building such classifiers requires a sufficient amount of training data that is not available at the beginning of a software project. Cross-Project (CP) JIT-SDP can overcome this issue by using data from other projects to build the classifier, achieving similar (not better) predictive performance to classifiers trained on Within-Project (WP) data. However, such approaches have never been investigated in realistic online learning scenarios, where WP software changes arrive continuously over time and can be used to update the classifiers. It is unknown to what extent CP data can be helpful in such situation. In particular, it is unknown whether CP data are only useful during the very initial phase of the project when there is little WP data, or whether they could be helpful for extended periods of time. This work thus provides the first investigation of when and to what extent CP data are useful for JIT-SDP in a realistic online learning scenario. For that, we develop three different CP JIT-SDP approaches that can operate in online mode and be updated with both incoming CP and WP training examples over time.We also collect 2048 commits from three software repositories being developed by a software company over the course of 9 to 10 months, and use 19,8468 commits from 10 active open source GitHub projects being developed over the course of 6 to 14 years. The study shows that training classifiers with incoming CP+WP data can lead to improvements in G-mean of up to 53.90% compared to classifiers using only WP data at the initial stage of the projects. For the open source projects, which have been running for longer periods of time, using CP data to supplement WP data also helped the classifiers to reduce or prevent large drops in predictive performance that may occur over time, leading to up to around 40% better G-Mean during such periods. Such use of CP data was shown to be beneficial even after a large number of WP data were received, leading to overall G-means up to 18.5% better than those of WP classifiers.

KW - Software defect prediction

KW - cross-project learning

KW - transfer learning

KW - online learning

KW - verification latency

KW - concept drift

KW - class imbalance

M3 - Conference contribution

BT - 42nd International Conference on Software Engineering (ICSE 2020)

PB - IEEE Computer Society Press

T2 - 42nd International Conference on Software Engineering (ICSE 2020)

Y2 - 23 May 2020 through 29 May 2020

ER -