Journal ArticleUnknown

Code Comment Classification with Data Augmentation and Transformer-Based Models

Authors

Mushfiqur Rahman, Mohammed Latif Siddiq

Author Affiliations

Bangladesh University of Engineering and Technology, University of Notre Dame

Year2025

Citations2

DOI10.1109/nlbse66842.2025.00013

Abstract

Effective classification of code comment sentences into meaningful categories is critical for software comprehension and maintenance. In this work, we present a solution for the NLBSE'25 Code Comment Classification Tool Competition, achieving a 6.7% improvement in accuracy over the baseline STACC models. Our solution employs a multi-step methodology, beginning with translation-retranslation techniques to generate synthetic datasets. By translating the original dataset into multiple languages and back into English, we introduce linguistic diversity that enriches the training data and improves model generalization. We fine-tuned transformer-based architectures, including BERT, CodeBERT, RoBERTa, and DistilBERT, on this augmented dataset. After extensive evaluation, the best-performing model is selected for a robust multi-label classification framework tailored to Java, Python, and Pharo databases. The framework is designed…

View at Publisher

BORR does not host full-text PDFs. The button above takes you to the original publisher.

Fields & Keywords

Physical Sciences Computer Science Artificial Intelligence Natural Language Processing Techniques Software Engineering Research Web Application Security Vulnerabilities Programming language Electrical engineering