Detecting programming language from source code using bayesian learning techniques
Detecting programming language from source code using bayesian learning techniques
With dozens of popular programming languages used worldwide, the number of source code files of programs available online for public use is massive. However most blogs, forums or online Q&A websites have poor searchability for specific programming language source code. Näive thumb rules based on the file extension if any are invariably used for syntax highlighting, indentation and other ways to improve readability of the code by programming language editors. A more systematic way to identify the language in which a given source file was written would be of immense value. We believe that simple Bayesiam models would be adequate for this given the intrinsic syntactic structure of any programming language. In this paper, we present Bayesian learning models for correctly identifying the programming language in which a given piece of source code was written, with high probability. We have used 20000 source code files across 10 programming languages to train and test the model using the following Bayesian classifier models - Naive Bayes, Bayesian Network and Multinomial Naive Bayes. Lastly, we show a performance comparison among the three models in terms of classification accuracy on the test data.
bayesian learning, bayesian network, multinomial naive bayes, naive bayes, programming language identifier, source code detection
513-522
Khasnabish, Jyotiska Nath
d5d7b95c-03ff-49f1-8991-37aa4032ea90
Sodhi, Mitali
92818bc0-cc1d-4400-bf1d-4ac9f5b5a86d
Deshmukh, Jayati
5903b0c1-b4d1-4fbf-b687-610d4fde3990
Srinivasaraghavan, G.
bcc93e80-a54c-48e6-bfd1-228ae9cb65f0
2014
Khasnabish, Jyotiska Nath
d5d7b95c-03ff-49f1-8991-37aa4032ea90
Sodhi, Mitali
92818bc0-cc1d-4400-bf1d-4ac9f5b5a86d
Deshmukh, Jayati
5903b0c1-b4d1-4fbf-b687-610d4fde3990
Srinivasaraghavan, G.
bcc93e80-a54c-48e6-bfd1-228ae9cb65f0
Khasnabish, Jyotiska Nath, Sodhi, Mitali, Deshmukh, Jayati and Srinivasaraghavan, G.
(2014)
Detecting programming language from source code using bayesian learning techniques.
In Machine Learning and Data Mining in Pattern Recognition - 10th International Conference, MLDM 2014, Proceedings.
vol. 8556 LNAI,
Springer.
.
(doi:10.1007/978-3-319-08979-9_39).
Record type:
Conference or Workshop Item
(Paper)
Abstract
With dozens of popular programming languages used worldwide, the number of source code files of programs available online for public use is massive. However most blogs, forums or online Q&A websites have poor searchability for specific programming language source code. Näive thumb rules based on the file extension if any are invariably used for syntax highlighting, indentation and other ways to improve readability of the code by programming language editors. A more systematic way to identify the language in which a given source file was written would be of immense value. We believe that simple Bayesiam models would be adequate for this given the intrinsic syntactic structure of any programming language. In this paper, we present Bayesian learning models for correctly identifying the programming language in which a given piece of source code was written, with high probability. We have used 20000 source code files across 10 programming languages to train and test the model using the following Bayesian classifier models - Naive Bayes, Bayesian Network and Multinomial Naive Bayes. Lastly, we show a performance comparison among the three models in terms of classification accuracy on the test data.
This record has no associated files available for download.
More information
Published date: 2014
Venue - Dates:
10th International Conference on Machine Learning and Data Mining in Pattern Recognition, MLDM 2014, , St. Petersburg, Russian Federation, 2014-07-21 - 2014-07-24
Keywords:
bayesian learning, bayesian network, multinomial naive bayes, naive bayes, programming language identifier, source code detection
Identifiers
Local EPrints ID: 493374
URI: http://eprints.soton.ac.uk/id/eprint/493374
ISSN: 0302-9743
PURE UUID: 733c3289-6be5-47eb-99f1-08afbb7a2159
Catalogue record
Date deposited: 30 Aug 2024 17:09
Last modified: 31 Aug 2024 02:12
Export record
Altmetrics
Contributors
Author:
Jyotiska Nath Khasnabish
Author:
Mitali Sodhi
Author:
Jayati Deshmukh
Author:
G. Srinivasaraghavan
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics