A methodology on converting 10-K filings into a machine learning dataset and its applications
A methodology on converting 10-K filings into a machine learning dataset and its applications
Companies listed on the stock exchange are required to share their annual reports with the U.S. Securities and Exchange Commission (SEC) within the first three months following the fiscal year. These reports, namely 10-K Filings, are presented to public interest by the SEC through an Electronic Data Gathering, Analysis, and Retrieval database. 10-K Filings use standard file formats (xbrl, html, pdf) to publish the financial reports of the companies. Although the file formats propose a standard structure, the content and the meta-data of the financial reports (e.g. tag names) is not strictly bound to a pre-defined schema. This study proposes a data collection and data preprocessing method to semantify the financial reports and use the collected data for further analysis (i.e. machine learning). The analysis of eight different datasets, which were created during the study, are presented using the proposed data transformation methods. As a use case, based on the datasets, five different machine learning algorithms were utilized to predict the existence of the corresponding company in the S&P 500 index. According to the strong machine learning results, the dataset generation methodology is successful and the datasets are ready for further use.
10-K filings, EDGAR, XBRL, data pre-processing, machine learning
477-487
Sami Kacar, M.
2f6ef9ab-ff39-4d7c-b2af-ef6d2fdba42f
Yumusak, S.
5a45f53d-7a3c-4e3d-93b1-bc83f7096f37
Kodaz, H.
23792a05-de24-4c58-bf0e-132af51332cc
1 April 2023
Sami Kacar, M.
2f6ef9ab-ff39-4d7c-b2af-ef6d2fdba42f
Yumusak, S.
5a45f53d-7a3c-4e3d-93b1-bc83f7096f37
Kodaz, H.
23792a05-de24-4c58-bf0e-132af51332cc
Sami Kacar, M., Yumusak, S. and Kodaz, H.
(2023)
A methodology on converting 10-K filings into a machine learning dataset and its applications.
IEICE Transactions on Information and Systems: Special Issue on Human Communications, E106D (4), .
(doi:10.1587/TRANSINF.2022IIP0001).
Abstract
Companies listed on the stock exchange are required to share their annual reports with the U.S. Securities and Exchange Commission (SEC) within the first three months following the fiscal year. These reports, namely 10-K Filings, are presented to public interest by the SEC through an Electronic Data Gathering, Analysis, and Retrieval database. 10-K Filings use standard file formats (xbrl, html, pdf) to publish the financial reports of the companies. Although the file formats propose a standard structure, the content and the meta-data of the financial reports (e.g. tag names) is not strictly bound to a pre-defined schema. This study proposes a data collection and data preprocessing method to semantify the financial reports and use the collected data for further analysis (i.e. machine learning). The analysis of eight different datasets, which were created during the study, are presented using the proposed data transformation methods. As a use case, based on the datasets, five different machine learning algorithms were utilized to predict the existence of the corresponding company in the S&P 500 index. According to the strong machine learning results, the dataset generation methodology is successful and the datasets are ready for further use.
More information
e-pub ahead of print date: 22 October 2022
Published date: 1 April 2023
Additional Information:
Publisher Copyright:
© 2023 The Institute of Electronics, Information and Communication Engineers.
Keywords:
10-K filings, EDGAR, XBRL, data pre-processing, machine learning
Identifiers
Local EPrints ID: 477504
URI: http://eprints.soton.ac.uk/id/eprint/477504
ISSN: 0916-8532
PURE UUID: b0e0f91f-6f6c-4b1b-83cb-dd5b27e3def2
Catalogue record
Date deposited: 07 Jun 2023 16:55
Last modified: 17 Mar 2024 02:35
Export record
Altmetrics
Contributors
Author:
M. Sami Kacar
Author:
S. Yumusak
Author:
H. Kodaz
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics