Journal ArticleUnknown
Code Comment Classification with Data Augmentation and Transformer-Based Models
Author Affiliations
Bangladesh University of Engineering and Technology, University of Notre Dame
Year2025
Citations2
Abstract
Effective classification of code comment sentences into meaningful categories is critical for software comprehension and maintenance. In this work, we present a solution for the NLBSE'25 Code Comment Classification Tool Competition, achieving a 6.7% improvement in accuracy over the baseline STACC models. Our solution employs a multi-step methodology, beginning with translation-retranslation techniques to generate synthetic datasets. By translating the original dataset into multiple languages and back into English, we introduce linguistic diversity that enriches the training data and improves model generalization. We fine-tuned transformer-based architectures, including BERT, CodeBERT, RoBERTa, and DistilBERT, on this augmented dataset. After extensive evaluation, the best-performing model is selected for a robust multi-label classification framework tailored to Java, Python, and Pharo databases. The framework is designed…
View at Publisher
BORR does not host full-text PDFs. The button above takes you to the original publisher.