The University of Southampton
University of Southampton Institutional Repository

A large-scale dataset of popular open source projects

A large-scale dataset of popular open source projects
A large-scale dataset of popular open source projects
Online open source software repositories offer a wealth of information related to software artifacts and the development process, making them a valuable source for research data. Mining software repositories and retrieving project data from them provide an opportunity to build large-scale datasets of selected, high quality, real project data. Such datasets could be used to empirically validate assumptions, test hypotheses, and verify anecdotal claims about software development processes and the resulting artifacts. Moreover, publishing them would make replicability and verification of studies possible that, inturn, can enhance research quality. Thus, in this work, we publish a large-scale dataset, of 4349 projects in11 general-purpose programming languages gathered from Github repositories, where a primary language can be identified. The usage of such a dataset can vary from empirically validating claims in the software engineering field, to machine learning training and test sets.
1796-203X
240-246
Altherwi, Muna
71c2ae4d-ae5a-48c8-ba74-7e97d95a0837
Gravell, Andrew M.
f3a261c5-f057-4b5f-b6ac-c1ca37d72749
Altherwi, Muna
71c2ae4d-ae5a-48c8-ba74-7e97d95a0837
Gravell, Andrew M.
f3a261c5-f057-4b5f-b6ac-c1ca37d72749

Altherwi, Muna and Gravell, Andrew M. (2019) A large-scale dataset of popular open source projects. Journal of Computers, 14 (4), 240-246. (doi:10.17706/jcp.14.4.240-246).

Record type: Article

Abstract

Online open source software repositories offer a wealth of information related to software artifacts and the development process, making them a valuable source for research data. Mining software repositories and retrieving project data from them provide an opportunity to build large-scale datasets of selected, high quality, real project data. Such datasets could be used to empirically validate assumptions, test hypotheses, and verify anecdotal claims about software development processes and the resulting artifacts. Moreover, publishing them would make replicability and verification of studies possible that, inturn, can enhance research quality. Thus, in this work, we publish a large-scale dataset, of 4349 projects in11 general-purpose programming languages gathered from Github repositories, where a primary language can be identified. The usage of such a dataset can vary from empirically validating claims in the software engineering field, to machine learning training and test sets.

This record has no associated files available for download.

More information

Accepted/In Press date: 13 March 2019
Published date: 2 April 2019

Identifiers

Local EPrints ID: 474715
URI: http://eprints.soton.ac.uk/id/eprint/474715
ISSN: 1796-203X
PURE UUID: 29790476-bd6a-4e44-b4d0-ed7006a15d84

Catalogue record

Date deposited: 01 Mar 2023 18:04
Last modified: 16 Mar 2024 17:27

Export record

Altmetrics

Contributors

Author: Muna Altherwi
Author: Andrew M. Gravell

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×