JavaBERT: training a transformer-based model for the Java programming language
JavaBERT: training a transformer-based model for the Java programming language
Code quality is and will be a crucial factor while developing new software code, requiring appropriate tools to ensure functional and reliable code. Machine learning techniques are still rarely used for software engineering tools, missing out the potential benefits of its application. Natural language processing has shown the potential to process text data regarding a variety of tasks. We argue, that such models can also show similar benefits for software code processing. In this paper, we investigate how models used for natural language processing can be trained upon software code. We introduce a data retrieval pipeline for software code and train a model upon Java software code. The resulting model, JavaBERT, shows a high accuracy on the masked language modeling task showing its potential for software engineering tools.
90-95
De Sousa, Nelson Tavares
fc0aa5f6-e988-4557-9139-fa58acef7aba
Hasselbring, Wilhelm
ee89c5c9-a900-40b1-82c1-552268cd01bd
2021
De Sousa, Nelson Tavares
fc0aa5f6-e988-4557-9139-fa58acef7aba
Hasselbring, Wilhelm
ee89c5c9-a900-40b1-82c1-552268cd01bd
De Sousa, Nelson Tavares and Hasselbring, Wilhelm
(2021)
JavaBERT: training a transformer-based model for the Java programming language.
In 2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW).
IEEE.
.
(doi:10.1109/ASEW52652.2021.00028).
Record type:
Conference or Workshop Item
(Paper)
Abstract
Code quality is and will be a crucial factor while developing new software code, requiring appropriate tools to ensure functional and reliable code. Machine learning techniques are still rarely used for software engineering tools, missing out the potential benefits of its application. Natural language processing has shown the potential to process text data regarding a variety of tasks. We argue, that such models can also show similar benefits for software code processing. In this paper, we investigate how models used for natural language processing can be trained upon software code. We introduce a data retrieval pipeline for software code and train a model upon Java software code. The resulting model, JavaBERT, shows a high accuracy on the masked language modeling task showing its potential for software engineering tools.
This record has no associated files available for download.
More information
Published date: 2021
Venue - Dates:
36th IEEE/ACM International Conference on Automated Software Engineering Workshops, ASEW 2021, , Virtual, Online, Australia, 2021-11-15 - 2021-11-19
Identifiers
Local EPrints ID: 488762
URI: http://eprints.soton.ac.uk/id/eprint/488762
PURE UUID: 9855155e-c462-4803-b21f-abbc8850be16
Catalogue record
Date deposited: 05 Apr 2024 16:37
Last modified: 10 Apr 2024 02:15
Export record
Altmetrics
Contributors
Author:
Nelson Tavares De Sousa
Author:
Wilhelm Hasselbring
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics