The University of Southampton
University of Southampton Institutional Repository

JavaBERT: training a transformer-based model for the Java programming language

JavaBERT: training a transformer-based model for the Java programming language
JavaBERT: training a transformer-based model for the Java programming language

Code quality is and will be a crucial factor while developing new software code, requiring appropriate tools to ensure functional and reliable code. Machine learning techniques are still rarely used for software engineering tools, missing out the potential benefits of its application. Natural language processing has shown the potential to process text data regarding a variety of tasks. We argue, that such models can also show similar benefits for software code processing. In this paper, we investigate how models used for natural language processing can be trained upon software code. We introduce a data retrieval pipeline for software code and train a model upon Java software code. The resulting model, JavaBERT, shows a high accuracy on the masked language modeling task showing its potential for software engineering tools.

90-95
IEEE
De Sousa, Nelson Tavares
fc0aa5f6-e988-4557-9139-fa58acef7aba
Hasselbring, Wilhelm
ee89c5c9-a900-40b1-82c1-552268cd01bd
De Sousa, Nelson Tavares
fc0aa5f6-e988-4557-9139-fa58acef7aba
Hasselbring, Wilhelm
ee89c5c9-a900-40b1-82c1-552268cd01bd

De Sousa, Nelson Tavares and Hasselbring, Wilhelm (2021) JavaBERT: training a transformer-based model for the Java programming language. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW). IEEE. pp. 90-95 . (doi:10.1109/ASEW52652.2021.00028).

Record type: Conference or Workshop Item (Paper)

Abstract

Code quality is and will be a crucial factor while developing new software code, requiring appropriate tools to ensure functional and reliable code. Machine learning techniques are still rarely used for software engineering tools, missing out the potential benefits of its application. Natural language processing has shown the potential to process text data regarding a variety of tasks. We argue, that such models can also show similar benefits for software code processing. In this paper, we investigate how models used for natural language processing can be trained upon software code. We introduce a data retrieval pipeline for software code and train a model upon Java software code. The resulting model, JavaBERT, shows a high accuracy on the masked language modeling task showing its potential for software engineering tools.

This record has no associated files available for download.

More information

Published date: 2021
Venue - Dates: 36th IEEE/ACM International Conference on Automated Software Engineering Workshops, ASEW 2021, , Virtual, Online, Australia, 2021-11-15 - 2021-11-19

Identifiers

Local EPrints ID: 488762
URI: http://eprints.soton.ac.uk/id/eprint/488762
PURE UUID: 9855155e-c462-4803-b21f-abbc8850be16
ORCID for Wilhelm Hasselbring: ORCID iD orcid.org/0000-0001-6625-4335

Catalogue record

Date deposited: 05 Apr 2024 16:37
Last modified: 10 Apr 2024 02:15

Export record

Altmetrics

Contributors

Author: Nelson Tavares De Sousa
Author: Wilhelm Hasselbring ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×