The University of Southampton
University of Southampton Institutional Repository

A methodology on converting 10-K filings into a machine learning dataset and its applications

A methodology on converting 10-K filings into a machine learning dataset and its applications
A methodology on converting 10-K filings into a machine learning dataset and its applications

Companies listed on the stock exchange are required to share their annual reports with the U.S. Securities and Exchange Commission (SEC) within the first three months following the fiscal year. These reports, namely 10-K Filings, are presented to public interest by the SEC through an Electronic Data Gathering, Analysis, and Retrieval database. 10-K Filings use standard file formats (xbrl, html, pdf) to publish the financial reports of the companies. Although the file formats propose a standard structure, the content and the meta-data of the financial reports (e.g. tag names) is not strictly bound to a pre-defined schema. This study proposes a data collection and data preprocessing method to semantify the financial reports and use the collected data for further analysis (i.e. machine learning). The analysis of eight different datasets, which were created during the study, are presented using the proposed data transformation methods. As a use case, based on the datasets, five different machine learning algorithms were utilized to predict the existence of the corresponding company in the S&P 500 index. According to the strong machine learning results, the dataset generation methodology is successful and the datasets are ready for further use.

10-K filings, EDGAR, XBRL, data pre-processing, machine learning
0916-8532
477-487
Sami Kacar, M.
2f6ef9ab-ff39-4d7c-b2af-ef6d2fdba42f
Yumusak, S.
5a45f53d-7a3c-4e3d-93b1-bc83f7096f37
Kodaz, H.
23792a05-de24-4c58-bf0e-132af51332cc
Sami Kacar, M.
2f6ef9ab-ff39-4d7c-b2af-ef6d2fdba42f
Yumusak, S.
5a45f53d-7a3c-4e3d-93b1-bc83f7096f37
Kodaz, H.
23792a05-de24-4c58-bf0e-132af51332cc

Sami Kacar, M., Yumusak, S. and Kodaz, H. (2023) A methodology on converting 10-K filings into a machine learning dataset and its applications. IEICE Transactions on Information and Systems: Special Issue on Human Communications, E106D (4), 477-487. (doi:10.1587/TRANSINF.2022IIP0001).

Record type: Article

Abstract

Companies listed on the stock exchange are required to share their annual reports with the U.S. Securities and Exchange Commission (SEC) within the first three months following the fiscal year. These reports, namely 10-K Filings, are presented to public interest by the SEC through an Electronic Data Gathering, Analysis, and Retrieval database. 10-K Filings use standard file formats (xbrl, html, pdf) to publish the financial reports of the companies. Although the file formats propose a standard structure, the content and the meta-data of the financial reports (e.g. tag names) is not strictly bound to a pre-defined schema. This study proposes a data collection and data preprocessing method to semantify the financial reports and use the collected data for further analysis (i.e. machine learning). The analysis of eight different datasets, which were created during the study, are presented using the proposed data transformation methods. As a use case, based on the datasets, five different machine learning algorithms were utilized to predict the existence of the corresponding company in the S&P 500 index. According to the strong machine learning results, the dataset generation methodology is successful and the datasets are ready for further use.

Text
E106.D_2022IIP0001
Download (1MB)

More information

e-pub ahead of print date: 22 October 2022
Published date: 1 April 2023
Additional Information: Publisher Copyright: © 2023 The Institute of Electronics, Information and Communication Engineers.
Keywords: 10-K filings, EDGAR, XBRL, data pre-processing, machine learning

Identifiers

Local EPrints ID: 477504
URI: http://eprints.soton.ac.uk/id/eprint/477504
ISSN: 0916-8532
PURE UUID: b0e0f91f-6f6c-4b1b-83cb-dd5b27e3def2

Catalogue record

Date deposited: 07 Jun 2023 16:55
Last modified: 17 Mar 2024 02:35

Export record

Altmetrics

Contributors

Author: M. Sami Kacar
Author: S. Yumusak
Author: H. Kodaz

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×