An Investigation of Cross-Project Learning in Online Just-In-Time Software Defect Prediction

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Authors

Colleges, School and Institutes

External organisations

  • Xiliu Tech, China
  • Federal Rural University of Pernambuco, Brazil

Abstract

Just-In-Time Software Defect Prediction (JIT-SDP) is concerned with predicting whether software changes are defect-inducing or clean based on machine learning classifiers. Building such classifiers requires a sufficient amount of training data that is not available at the beginning of a software project. Cross-Project (CP) JIT-SDP can overcome this issue by using data from other projects to build the classifier, achieving similar (not better) predictive performance to classifiers trained on Within-Project (WP) data. However, such approaches have never been investigated in realistic online learning scenarios, where WP software changes arrive continuously over time and can be used to update the classifiers. It is unknown to what extent CP data can be helpful in such situation. In particular, it is unknown whether CP data are only useful during the very initial phase of the project when there is little WP data, or whether they could be helpful for extended periods of time. This work thus provides the first investigation of when and to what extent CP data are useful for JIT-SDP in a realistic online learning scenario. For that, we develop three different CP JIT-SDP approaches that can operate in online mode and be updated with both incoming CP and WP training examples over time.We also collect 2048 commits from three software repositories being developed by a software company over the course of 9 to 10 months, and use 19,8468 commits from 10 active open source GitHub projects being developed over the course of 6 to 14 years. The study shows that training classifiers with incoming CP+WP data can lead to improvements in G-mean of up to 53.90% compared to classifiers using only WP data at the initial stage of the projects. For the open source projects, which have been running for longer periods of time, using CP data to supplement WP data also helped the classifiers to reduce or prevent large drops in predictive performance that may occur over time, leading to up to around 40% better G-Mean during such periods. Such use of CP data was shown to be beneficial even after a large number of WP data were received, leading to overall G-means up to 18.5% better than those of WP classifiers.

Details

Original languageEnglish
Title of host publication42nd International Conference on Software Engineering (ICSE 2020)
Publication statusAccepted/In press - 8 Dec 2019
Event42nd International Conference on Software Engineering (ICSE 2020) - Seoul, Korea, Republic of
Duration: 23 May 202029 May 2020

Conference

Conference42nd International Conference on Software Engineering (ICSE 2020)
CountryKorea, Republic of
CitySeoul
Period23/05/2029/05/20

Keywords

  • Software defect prediction, cross-project learning, transfer learning, online learning, verification latency, concept drift, class imbalance