Authors:

Karthik ShivashankarMili OrucevicMaren Maritsdatter KrukeAntonio Martini

Paper:

https://arxiv.org/abs/2408.09128

Introduction

Technical Debt (TD) is a critical concept in software development, representing the future cost of additional work due to choosing suboptimal solutions or evolving requirements. Managing TD is essential for maintaining code quality, reducing long-term maintenance costs, and ensuring the overall health of software projects. This study, conducted by Karthik Shivashankar, Mili Orucevic, Maren Maritsdatter Kruke, and Antonio Martini, advances TD classification using transformer-based models. The research focuses on enhancing the accuracy and efficiency of TD identification in large-scale software development by employing multiple binary classifiers combined through ensemble learning.

Related Work

Transformer Models

Transformer architectures, such as BERT and GPT, have revolutionized natural language processing (NLP) by capturing contextual relationships between words in a sentence. These models leverage deep learning techniques to understand and generate human-like text, making them suitable for interpreting software documentation and issue trackers.

Comparison of LLMs and BERT-based Models

Large Language Models (LLMs) like GPT-3 and GPT-4 are known for their generative capabilities, while models like DistilRoBERTa offer a more efficient alternative with reduced computational requirements. The choice between these models depends on the specific requirements of the task, including available computational resources and the complexity of the classification problem.

Research Methodology

Data Mining

The study utilized the GitHub Archive (GHArchive) to accumulate a large dataset of software development issues from January 1, 2015, to May 25, 2024. A carefully crafted regular expression (regex) pattern was employed to identify TD-related issues and specific types of TD from the dataset.

Dataset Processing and Cleaning

The raw data underwent extensive preprocessing and cleaning to prepare it for effective model training. This included duplicate removal, text normalization, noise reduction, and content filtering. The dataset was balanced to ensure robust model training, with equal proportions of positive and negative labels for each TD category.

Model Training

Binary classification models were trained on a balanced dataset using 5-fold cross-validation over five epochs. Individual binary classifiers were trained for each specific TD type, employing an ensemble learning technique to improve predictive performance. For multiclass classification, stratified 5-fold cross-validation was used, and class weights were incorporated to mitigate the impact of class imbalance.

Model Evaluation and Testing with OOD Dataset

The models were evaluated using standard metrics such as accuracy, precision, recall, MCC, AUC ROC, and F1-score on a reserved test set. Additionally, the models were tested against an out-of-distribution (OOD) dataset to determine their generalization capabilities to new, unseen data contexts.

Experimental Design

Planning and Designing the Experiment

The experiment involved training and evaluating transformer-based models on a comprehensive dataset from GitHub Archive Issues, supplemented with industrial data validation. The models’ performance was assessed on both in-distribution and out-of-distribution datasets to ensure practical applicability.

Preparing the Data and Conditions

The dataset was curated by mining GitHub issues labeled with keywords corresponding to various TD types. The data was split into training and testing sets using an 85/15 ratio, maintaining the balance of positive and negative labels within each subset. An OOD dataset was created for robust evaluation.

Results and Analysis

RQ1: Effectiveness of Transformer-Based Models in Classifying TD Issues

The results demonstrated that the TD (DistilRoBERTa) binary classifier model exhibited consistently high performance on both the test set and the OOD dataset, with precision, recall, and F1 scores all exceeding 0.90. However, a performance drop was observed on the OOD VSCode dataset, highlighting the challenge of domain shift.

RQ1.1: Impact of Fine-Tuning on Model Performance

Fine-tuning the models on project-specific data significantly improved their performance. The fine-tuned models consistently outperformed their non-fine-tuned counterparts across all datasets, with substantial improvements in precision, recall, accuracy, F1-score, MCC, and AUC ROC.

RQ2: Comparison of GPT and DistilRoBERTa Models

The fine-tuned TD DistilRoBERTa model outperformed the out-of-the-box GPT4o model across all metrics, demonstrating the superior performance of the smaller, more efficient DistilRoBERTa model in TD classification tasks.

RQ2.1: Task-Specific Fine-Tuning of GPT Models

The task-specific fine-tuned TD DistilRoBERTa model generally outperformed the larger GPT models across most metrics, highlighting the potential cost-benefit advantage of using more resource-efficient models without significant performance loss.

RQ3: Effectiveness of Expert Ensemble of Binary Classifiers

The binary classifiers demonstrated higher precision and recall in both test and OOD datasets compared to the multiclass model. The ensemble of binary classifiers provided more precise and reliable identification of specific TD types.

RQ 3.1: Performance Comparison Across Different Issue Types

The fine-tuned DistilRoBERTa model demonstrated superior performance in most technical categories compared to GPT-4o, highlighting its effectiveness in classifying various TD issue types.

Overall Conclusion

This study advances the field of TD classification by demonstrating the effectiveness of transformer-based models, particularly DistilRoBERTa, in identifying various TD types. Fine-tuning on project-specific data significantly enhances model performance, and the use of expert binary classifiers provides more precise and reliable identification of TD types. The research highlights the importance of targeted fine-tuning over sheer model size and underscores the practical applicability of the proposed approach in diverse software projects. The release of the curated dataset aims to stimulate further advancements in TD classification research, ultimately enhancing software project outcomes and development practices by enabling early TD identification and management.

Share.

Comments are closed.

Exit mobile version