An Investigation of Cross-Project Learning in Online Just-In-Time Software Defect Prediction

Sadia Tabassum; Leandro L. Minku; Danyi  Feng; George G.  Cabral; Liyan Song

doi:https://doi.org/10.1145/3377811.3380403

An Investigation of Cross-Project Learning in Online Just-In-Time Software Defect Prediction

Sadia Tabassum, Leandro L. Minku, Danyi Feng, George G. Cabral, Liyan Song

Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

Just-In-Time Software Defect Prediction (JIT-SDP) is concerned with predicting whether software changes are defect-inducing or clean based on machine learning classifiers. Building such classifiers requires a sufficient amount of training data that is not available at the beginning of a software project. Cross-Project (CP) JIT-SDP can overcome this issue by using data from other projects to build the classifier, achieving similar (not better) predictive performance to classifiers trained on Within-Project (WP) data. However, such approaches have never been investigated in realistic online learning scenarios, where WP software changes arrive continuously over time and can be used to update the classifiers. It is unknown to what extent CP data can be helpful in such situation. In particular, it is unknown whether CP data are only useful during the very initial phase of the project when there is little WP data, or whether they could be helpful for extended periods of time. This work thus provides the first investigation of when and to what extent CP data are useful for JIT-SDP in a realistic online learning scenario. For that, we develop three different CP JIT-SDP approaches that can operate in online mode and be updated with both incoming CP and WP training examples over time.We also collect 2048 commits from three software repositories being developed by a software company over the course of 9 to 10 months, and use 19,8468 commits from 10 active open source GitHub projects being developed over the course of 6 to 14 years. The study shows that training classifiers with incoming CP+WP data can lead to improvements in G-mean of up to 53.90% compared to classifiers using only WP data at the initial stage of the projects. For the open source projects, which have been running for longer periods of time, using CP data to supplement WP data also helped the classifiers to reduce or prevent large drops in predictive performance that may occur over time, leading to up to around 40% better G-Mean during such periods. Such use of CP data was shown to be beneficial even after a large number of WP data were received, leading to overall G-means up to 18.5% better than those of WP classifiers.

Original language	English
Title of host publication	42nd International Conference on Software Engineering (ICSE 2020)
Publisher	IEEE Computer Society Press
Number of pages	12
DOIs	https://doi.org/10.1145/3377811.3380403
Publication status	Published - 27 Jun 2020
Event	42nd International Conference on Software Engineering (ICSE 2020) - Seoul, Korea, Republic of Duration: 23 May 2020 → 29 May 2020

Conference

Conference	42nd International Conference on Software Engineering (ICSE 2020)
Country/Territory	Korea, Republic of
City	Seoul
Period	23/05/20 → 29/05/20

Keywords

Software defect prediction
cross-project learning
transfer learning
online learning
verification latency
concept drift
class imbalance

Access to Document

https://doi.org/10.1145/3377811.3380403

Cite this

@inproceedings{6e5447cb6c044dd78cc2cd40868140d9,

title = "An Investigation of Cross-Project Learning in Online Just-In-Time Software Defect Prediction",

abstract = "Just-In-Time Software Defect Prediction (JIT-SDP) is concerned with predicting whether software changes are defect-inducing or clean based on machine learning classifiers. Building such classifiers requires a sufficient amount of training data that is not available at the beginning of a software project. Cross-Project (CP) JIT-SDP can overcome this issue by using data from other projects to build the classifier, achieving similar (not better) predictive performance to classifiers trained on Within-Project (WP) data. However, such approaches have never been investigated in realistic online learning scenarios, where WP software changes arrive continuously over time and can be used to update the classifiers. It is unknown to what extent CP data can be helpful in such situation. In particular, it is unknown whether CP data are only useful during the very initial phase of the project when there is little WP data, or whether they could be helpful for extended periods of time. This work thus provides the first investigation of when and to what extent CP data are useful for JIT-SDP in a realistic online learning scenario. For that, we develop three different CP JIT-SDP approaches that can operate in online mode and be updated with both incoming CP and WP training examples over time.We also collect 2048 commits from three software repositories being developed by a software company over the course of 9 to 10 months, and use 19,8468 commits from 10 active open source GitHub projects being developed over the course of 6 to 14 years. The study shows that training classifiers with incoming CP+WP data can lead to improvements in G-mean of up to 53.90% compared to classifiers using only WP data at the initial stage of the projects. For the open source projects, which have been running for longer periods of time, using CP data to supplement WP data also helped the classifiers to reduce or prevent large drops in predictive performance that may occur over time, leading to up to around 40% better G-Mean during such periods. Such use of CP data was shown to be beneficial even after a large number of WP data were received, leading to overall G-means up to 18.5% better than those of WP classifiers.",

keywords = "Software defect prediction, cross-project learning, transfer learning, online learning, verification latency, concept drift, class imbalance",

author = "Sadia Tabassum and Minku, {Leandro L.} and Danyi Feng and Cabral, {George G.} and Liyan Song",

year = "2020",

month = jun,

day = "27",

doi = "https://doi.org/10.1145/3377811.3380403",

language = "English",

booktitle = "42nd International Conference on Software Engineering (ICSE 2020)",

publisher = "IEEE Computer Society Press",

note = "42nd International Conference on Software Engineering (ICSE 2020) ; Conference date: 23-05-2020 Through 29-05-2020",

}

Tabassum, S, Minku, LL, Feng, D, Cabral, GG & Song, L 2020, An Investigation of Cross-Project Learning in Online Just-In-Time Software Defect Prediction. in 42nd International Conference on Software Engineering (ICSE 2020). IEEE Computer Society Press, 42nd International Conference on Software Engineering (ICSE 2020), Seoul, Korea, Republic of, 23/05/20. https://doi.org/10.1145/3377811.3380403

TY - GEN

T1 - An Investigation of Cross-Project Learning in Online Just-In-Time Software Defect Prediction

AU - Tabassum, Sadia

AU - Minku, Leandro L.

AU - Feng, Danyi

AU - Cabral, George G.

AU - Song, Liyan

PY - 2020/6/27

Y1 - 2020/6/27

N2 - Just-In-Time Software Defect Prediction (JIT-SDP) is concerned with predicting whether software changes are defect-inducing or clean based on machine learning classifiers. Building such classifiers requires a sufficient amount of training data that is not available at the beginning of a software project. Cross-Project (CP) JIT-SDP can overcome this issue by using data from other projects to build the classifier, achieving similar (not better) predictive performance to classifiers trained on Within-Project (WP) data. However, such approaches have never been investigated in realistic online learning scenarios, where WP software changes arrive continuously over time and can be used to update the classifiers. It is unknown to what extent CP data can be helpful in such situation. In particular, it is unknown whether CP data are only useful during the very initial phase of the project when there is little WP data, or whether they could be helpful for extended periods of time. This work thus provides the first investigation of when and to what extent CP data are useful for JIT-SDP in a realistic online learning scenario. For that, we develop three different CP JIT-SDP approaches that can operate in online mode and be updated with both incoming CP and WP training examples over time.We also collect 2048 commits from three software repositories being developed by a software company over the course of 9 to 10 months, and use 19,8468 commits from 10 active open source GitHub projects being developed over the course of 6 to 14 years. The study shows that training classifiers with incoming CP+WP data can lead to improvements in G-mean of up to 53.90% compared to classifiers using only WP data at the initial stage of the projects. For the open source projects, which have been running for longer periods of time, using CP data to supplement WP data also helped the classifiers to reduce or prevent large drops in predictive performance that may occur over time, leading to up to around 40% better G-Mean during such periods. Such use of CP data was shown to be beneficial even after a large number of WP data were received, leading to overall G-means up to 18.5% better than those of WP classifiers.

AB - Just-In-Time Software Defect Prediction (JIT-SDP) is concerned with predicting whether software changes are defect-inducing or clean based on machine learning classifiers. Building such classifiers requires a sufficient amount of training data that is not available at the beginning of a software project. Cross-Project (CP) JIT-SDP can overcome this issue by using data from other projects to build the classifier, achieving similar (not better) predictive performance to classifiers trained on Within-Project (WP) data. However, such approaches have never been investigated in realistic online learning scenarios, where WP software changes arrive continuously over time and can be used to update the classifiers. It is unknown to what extent CP data can be helpful in such situation. In particular, it is unknown whether CP data are only useful during the very initial phase of the project when there is little WP data, or whether they could be helpful for extended periods of time. This work thus provides the first investigation of when and to what extent CP data are useful for JIT-SDP in a realistic online learning scenario. For that, we develop three different CP JIT-SDP approaches that can operate in online mode and be updated with both incoming CP and WP training examples over time.We also collect 2048 commits from three software repositories being developed by a software company over the course of 9 to 10 months, and use 19,8468 commits from 10 active open source GitHub projects being developed over the course of 6 to 14 years. The study shows that training classifiers with incoming CP+WP data can lead to improvements in G-mean of up to 53.90% compared to classifiers using only WP data at the initial stage of the projects. For the open source projects, which have been running for longer periods of time, using CP data to supplement WP data also helped the classifiers to reduce or prevent large drops in predictive performance that may occur over time, leading to up to around 40% better G-Mean during such periods. Such use of CP data was shown to be beneficial even after a large number of WP data were received, leading to overall G-means up to 18.5% better than those of WP classifiers.

KW - Software defect prediction

KW - cross-project learning

KW - transfer learning

KW - online learning

KW - verification latency

KW - concept drift

KW - class imbalance

U2 - https://doi.org/10.1145/3377811.3380403

DO - https://doi.org/10.1145/3377811.3380403

M3 - Conference contribution

BT - 42nd International Conference on Software Engineering (ICSE 2020)

PB - IEEE Computer Society Press

T2 - 42nd International Conference on Software Engineering (ICSE 2020)

Y2 - 23 May 2020 through 29 May 2020

ER -

An Investigation of Cross-Project Learning in Online Just-In-Time Software Defect Prediction

Abstract

Conference

Keywords

Access to Document

Fingerprint

Cite this