IMPLEMENTATION OF DATA LEVEL APPROACH TECHNIQUES TO SOLVE UNBALANCED DATA CASE ON SOFTWARE DEFECT CLASSIFICATION

  • Hanif Rahardian ULM
  • Mohammad Reza Faisal ULM
  • Friska Abadi ULM
  • Radityo Adi Nugroho ULM
  • Rudy Herteno ULM

Abstract

Defects can cause significant software rework, delays, and high costs, to prevent disability it must be predictable the possibility of defects. To predict the disability the metrics software dataset is used. NASA MDP is one of the popular software metrics used to predict software defects by having 13 datasets and is generally unbalanced. The reward in the dataset can reduce the prediction of software defects because more unbalanced data produces a majority class. Data imbalance can be handled with 2 approaches, namely the data level approach technique and the algorithm level approach technique. The data level approach technique aims to improve class distribution by using resampling and data synthesis techniques. This research proposes a data level approach using resampling techniques, namely Random Oversampling (ROS), Random Undersampling (RUS), Synthetic Minority Oversampling Technique (SMOTE), Tomek Link (TL) and One-Sided Selection (OSS) which are classified with Naïve Bayes was also validated using 10 Fold Cross-Validation, then evaluated with the Area Under ROC Curve (AUC). Prediction results based on the dataset obtained the best AUC value on MC2 with a value of 0.7277 using the Synthetic Minority Oversampling Technique (SMOTE). Prediction results based on the data level approach technique obtained the best average AUC value using Tomek Link (TL) with a value of 0.62587. Prediction results based on the dataset obtained the best AUC value on MC2 with a value of 0.7277 using the Synthetic Minority Oversampling Technique (SMOTE). Prediction results based on the data level approach technique obtained the best average AUC value using Tomek Link (TL) with a value of 0.62587. Prediction results based on the dataset obtained the best AUC value on MC2 with a value of 0.7277 using the Synthetic Minority Oversampling Technique (SMOTE). Prediction results based on the data level approach technique obtained the best average AUC value using Tomek Link (TL) with a value of 0.62587.

Published
2020-06-29
How to Cite
Rahardian, H., Mohammad Reza Faisal, Friska Abadi, Radityo Adi Nugroho, & Rudy Herteno. (2020). IMPLEMENTATION OF DATA LEVEL APPROACH TECHNIQUES TO SOLVE UNBALANCED DATA CASE ON SOFTWARE DEFECT CLASSIFICATION. Journal of Data Science and Software Engineering, 1(01), 53-62. https://doi.org/10.20527/jdsse.v1i01.13
Section
Articles