The University of Southampton
University of Southampton Institutional Repository

CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites

CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites
CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites
Background: it is estimated that up to 50% of all disease causing variants disrupt splicing. Due to its complexity, our ability to predict which variants disrupt splicing is limited, meaning missed diagnoses for patients. The emergence of machine learning for targeted medicine holds great potential to improve prediction of splice disrupting variants. The recently published SpliceAI algorithm utilises deep neural networks and has been reported to have a greater accuracy than other commonly used methods.

Methods and findings: the original SpliceAI was trained on splice sites included in primary isoforms combined with novel junctions observed in GTEx data, which might introduce noise and de-correlate the machine learning input with its output. Limiting the data to only validated and manual annotated primary and alternatively spliced GENCODE sites in training may improve predictive abilities. All of these gene isoforms were collapsed (aggregated into one pseudo-isoform) and the SpliceAI architecture was retrained (CI-SpliceAI). Predictive performance on a newly curated dataset of 1,316 functionally validated variants from the literature was compared with the original SpliceAI, alongside MMSplice, MaxEntScan, and SQUIRLS. Both SpliceAI algorithms outperformed the other methods, with the original SpliceAI achieving an accuracy of ∼91%, and CI-SpliceAI showing an improvement at ∼92% overall. Predictive accuracy increased in the majority of curated variants.

Conclusions: we show that including only manually annotated alternatively spliced sites in training data improves prediction of clinically relevant variants, and highlight avenues for further performance improvements.
1932-6203
e0269159
Strauch, Yaron
9e96ba4f-e839-4221-b718-8aa17763f972
Lord, Jenny
e1909780-36cd-4705-b21e-4580038d4ec6
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Baralle, Diana
faac16e5-7928-4801-9811-8b3a9ea4bb91
Palazzo, Alexander F.
39b6b8f3-360f-4a33-86a3-310a9cfbe2a9
Strauch, Yaron
9e96ba4f-e839-4221-b718-8aa17763f972
Lord, Jenny
e1909780-36cd-4705-b21e-4580038d4ec6
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Baralle, Diana
faac16e5-7928-4801-9811-8b3a9ea4bb91
Palazzo, Alexander F.
39b6b8f3-360f-4a33-86a3-310a9cfbe2a9

Strauch, Yaron, Lord, Jenny, Niranjan, Mahesan and Baralle, Diana , Palazzo, Alexander F. (ed.) (2022) CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites. PLoS ONE, 17 (6 June), e0269159, [e0269159]. (doi:10.1371/journal.pone.0269159).

Record type: Article

Abstract

Background: it is estimated that up to 50% of all disease causing variants disrupt splicing. Due to its complexity, our ability to predict which variants disrupt splicing is limited, meaning missed diagnoses for patients. The emergence of machine learning for targeted medicine holds great potential to improve prediction of splice disrupting variants. The recently published SpliceAI algorithm utilises deep neural networks and has been reported to have a greater accuracy than other commonly used methods.

Methods and findings: the original SpliceAI was trained on splice sites included in primary isoforms combined with novel junctions observed in GTEx data, which might introduce noise and de-correlate the machine learning input with its output. Limiting the data to only validated and manual annotated primary and alternatively spliced GENCODE sites in training may improve predictive abilities. All of these gene isoforms were collapsed (aggregated into one pseudo-isoform) and the SpliceAI architecture was retrained (CI-SpliceAI). Predictive performance on a newly curated dataset of 1,316 functionally validated variants from the literature was compared with the original SpliceAI, alongside MMSplice, MaxEntScan, and SQUIRLS. Both SpliceAI algorithms outperformed the other methods, with the original SpliceAI achieving an accuracy of ∼91%, and CI-SpliceAI showing an improvement at ∼92% overall. Predictive accuracy increased in the majority of curated variants.

Conclusions: we show that including only manually annotated alternatively spliced sites in training data improves prediction of clinically relevant variants, and highlight avenues for further performance improvements.

Text
CI-SpliceAI - Accepted Manuscript
Available under License Creative Commons Attribution.
Download (720kB)
Text
journal.pone.0269159 - Version of Record
Available under License Creative Commons Attribution.
Download (1MB)

More information

Accepted/In Press date: 16 May 2022
Published date: 3 June 2022
Additional Information: Funding Information: DB, JL, and YS are all supported by an NIHR Research Professorship to DB: RP-2016-07-011. MN received no specific funding for this work. NIHR - National Institute for Health Research - https://www.nihr.ac.uk/ The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Publisher Copyright: Copyright: © 2022 Strauch et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Identifiers

Local EPrints ID: 467297
URI: http://eprints.soton.ac.uk/id/eprint/467297
ISSN: 1932-6203
PURE UUID: 569d9385-c4df-4ab9-b7d3-7f55d31b7ef4
ORCID for Jenny Lord: ORCID iD orcid.org/0000-0002-0539-9343
ORCID for Mahesan Niranjan: ORCID iD orcid.org/0000-0001-7021-140X
ORCID for Diana Baralle: ORCID iD orcid.org/0000-0003-3217-4833

Catalogue record

Date deposited: 05 Jul 2022 16:49
Last modified: 17 Mar 2024 03:54

Export record

Altmetrics

Contributors

Author: Yaron Strauch
Author: Jenny Lord ORCID iD
Author: Mahesan Niranjan ORCID iD
Author: Diana Baralle ORCID iD
Editor: Alexander F. Palazzo

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×