The University of Southampton
University of Southampton Institutional Repository

Detecting programming language from source code using bayesian learning techniques

Detecting programming language from source code using bayesian learning techniques
Detecting programming language from source code using bayesian learning techniques

With dozens of popular programming languages used worldwide, the number of source code files of programs available online for public use is massive. However most blogs, forums or online Q&A websites have poor searchability for specific programming language source code. Näive thumb rules based on the file extension if any are invariably used for syntax highlighting, indentation and other ways to improve readability of the code by programming language editors. A more systematic way to identify the language in which a given source file was written would be of immense value. We believe that simple Bayesiam models would be adequate for this given the intrinsic syntactic structure of any programming language. In this paper, we present Bayesian learning models for correctly identifying the programming language in which a given piece of source code was written, with high probability. We have used 20000 source code files across 10 programming languages to train and test the model using the following Bayesian classifier models - Naive Bayes, Bayesian Network and Multinomial Naive Bayes. Lastly, we show a performance comparison among the three models in terms of classification accuracy on the test data.

bayesian learning, bayesian network, multinomial naive bayes, naive bayes, programming language identifier, source code detection
0302-9743
513-522
Springer
Khasnabish, Jyotiska Nath
d5d7b95c-03ff-49f1-8991-37aa4032ea90
Sodhi, Mitali
92818bc0-cc1d-4400-bf1d-4ac9f5b5a86d
Deshmukh, Jayati
5903b0c1-b4d1-4fbf-b687-610d4fde3990
Srinivasaraghavan, G.
bcc93e80-a54c-48e6-bfd1-228ae9cb65f0
Khasnabish, Jyotiska Nath
d5d7b95c-03ff-49f1-8991-37aa4032ea90
Sodhi, Mitali
92818bc0-cc1d-4400-bf1d-4ac9f5b5a86d
Deshmukh, Jayati
5903b0c1-b4d1-4fbf-b687-610d4fde3990
Srinivasaraghavan, G.
bcc93e80-a54c-48e6-bfd1-228ae9cb65f0

Khasnabish, Jyotiska Nath, Sodhi, Mitali, Deshmukh, Jayati and Srinivasaraghavan, G. (2014) Detecting programming language from source code using bayesian learning techniques. In Machine Learning and Data Mining in Pattern Recognition - 10th International Conference, MLDM 2014, Proceedings. vol. 8556 LNAI, Springer. pp. 513-522 . (doi:10.1007/978-3-319-08979-9_39).

Record type: Conference or Workshop Item (Paper)

Abstract

With dozens of popular programming languages used worldwide, the number of source code files of programs available online for public use is massive. However most blogs, forums or online Q&A websites have poor searchability for specific programming language source code. Näive thumb rules based on the file extension if any are invariably used for syntax highlighting, indentation and other ways to improve readability of the code by programming language editors. A more systematic way to identify the language in which a given source file was written would be of immense value. We believe that simple Bayesiam models would be adequate for this given the intrinsic syntactic structure of any programming language. In this paper, we present Bayesian learning models for correctly identifying the programming language in which a given piece of source code was written, with high probability. We have used 20000 source code files across 10 programming languages to train and test the model using the following Bayesian classifier models - Naive Bayes, Bayesian Network and Multinomial Naive Bayes. Lastly, we show a performance comparison among the three models in terms of classification accuracy on the test data.

This record has no associated files available for download.

More information

Published date: 2014
Venue - Dates: 10th International Conference on Machine Learning and Data Mining in Pattern Recognition, MLDM 2014, , St. Petersburg, Russian Federation, 2014-07-21 - 2014-07-24
Keywords: bayesian learning, bayesian network, multinomial naive bayes, naive bayes, programming language identifier, source code detection

Identifiers

Local EPrints ID: 493374
URI: http://eprints.soton.ac.uk/id/eprint/493374
ISSN: 0302-9743
PURE UUID: 733c3289-6be5-47eb-99f1-08afbb7a2159
ORCID for Jayati Deshmukh: ORCID iD orcid.org/0000-0002-1144-2635

Catalogue record

Date deposited: 30 Aug 2024 17:09
Last modified: 31 Aug 2024 02:12

Export record

Altmetrics

Contributors

Author: Jyotiska Nath Khasnabish
Author: Mitali Sodhi
Author: Jayati Deshmukh ORCID iD
Author: G. Srinivasaraghavan

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×