A methodology on converting 10-K filings into a machine learning dataset and its applications

Companies listed on the stock exchange are required to share their annual reports with the U.S. Securities and Exchange Commission (SEC) within the first three months following the fiscal year. These reports, namely 10-K Filings, are presented to public interest by the SEC through an Electronic Data Gathering, Analysis, and Retrieval database. 10-K Filings use standard file formats (xbrl, html, pdf) to publish the financial reports of the companies. Although the file formats propose a standard structure, the content and the meta-data of the financial reports (e.g. tag names) is not strictly bound to a pre-defined schema. This study proposes a data collection and data preprocessing method to semantify the financial reports and use the collected data for further analysis (i.e. machine learning). The analysis of eight different datasets, which were created during the study, are presented using the proposed data transformation methods. As a use case, based on the datasets, five different machine learning algorithms were utilized to predict the existence of the corresponding company in the S&P 500 index. According to the strong machine learning results, the dataset generation methodology is successful and the datasets are ready for further use.

10-K filings, EDGAR, XBRL, data pre-processing, machine learning

10.1587/TRANSINF.2022IIP0001

0916-8532

477-487

Sami Kacar, M.

2f6ef9ab-ff39-4d7c-b2af-ef6d2fdba42f

Yumusak, S.

5a45f53d-7a3c-4e3d-93b1-bc83f7096f37

Kodaz, H.

23792a05-de24-4c58-bf0e-132af51332cc