A large-scale dataset of popular open source projects
A large-scale dataset of popular open source projects
Online open source software repositories offer a wealth of information related to software artifacts and the development process, making them a valuable source for research data. Mining software repositories and retrieving project data from them provide an opportunity to build large-scale datasets of selected, high quality, real project data. Such datasets could be used to empirically validate assumptions, test hypotheses, and verify anecdotal claims about software development processes and the resulting artifacts. Moreover, publishing them would make replicability and verification of studies possible that, inturn, can enhance research quality. Thus, in this work, we publish a large-scale dataset, of 4349 projects in11 general-purpose programming languages gathered from Github repositories, where a primary language can be identified. The usage of such a dataset can vary from empirically validating claims in the software engineering field, to machine learning training and test sets.
240-246
Altherwi, Muna
71c2ae4d-ae5a-48c8-ba74-7e97d95a0837
Gravell, Andrew M.
f3a261c5-f057-4b5f-b6ac-c1ca37d72749
2 April 2019
Altherwi, Muna
71c2ae4d-ae5a-48c8-ba74-7e97d95a0837
Gravell, Andrew M.
f3a261c5-f057-4b5f-b6ac-c1ca37d72749
Altherwi, Muna and Gravell, Andrew M.
(2019)
A large-scale dataset of popular open source projects.
Journal of Computers, 14 (4), .
(doi:10.17706/jcp.14.4.240-246).
Abstract
Online open source software repositories offer a wealth of information related to software artifacts and the development process, making them a valuable source for research data. Mining software repositories and retrieving project data from them provide an opportunity to build large-scale datasets of selected, high quality, real project data. Such datasets could be used to empirically validate assumptions, test hypotheses, and verify anecdotal claims about software development processes and the resulting artifacts. Moreover, publishing them would make replicability and verification of studies possible that, inturn, can enhance research quality. Thus, in this work, we publish a large-scale dataset, of 4349 projects in11 general-purpose programming languages gathered from Github repositories, where a primary language can be identified. The usage of such a dataset can vary from empirically validating claims in the software engineering field, to machine learning training and test sets.
This record has no associated files available for download.
More information
Accepted/In Press date: 13 March 2019
Published date: 2 April 2019
Identifiers
Local EPrints ID: 474715
URI: http://eprints.soton.ac.uk/id/eprint/474715
ISSN: 1796-203X
PURE UUID: 29790476-bd6a-4e44-b4d0-ed7006a15d84
Catalogue record
Date deposited: 01 Mar 2023 18:04
Last modified: 16 Mar 2024 17:27
Export record
Altmetrics
Contributors
Author:
Muna Altherwi
Author:
Andrew M. Gravell
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics