Design and implementation of a Pipeline for advanced DNA analysis (NSG) by using Machine Learning Techniques

  1. Zakieh Alizadehsani 1
  2. Juan M. Corchado 1
  1. 1 Departamento de Informática y Automática, Facultad de Ciencias, Universidad de Salamanca
Livre:
Avances en Informática y Automática. Decimocuarto workshop
  1. Jorge Vicente Gabriel (coord.)
  2. Iñaki Arberas Vicente (coord.)

Éditorial: Universidad de Salamanca

ISBN: 978-84-09-25507-8

Année de publication: 2020

Pages: 165-183

Type: Chapitre d'ouvrage

Résumé

DNA sequencing is a lab method to determine the sequence of a DNA molecule. It could be used to discover human diversity and disease. Current sequencing technologies give large amounts of DNA sequence data, which are applied in a wide area of biological applications including disease discovery, genome expression analysis, and detection of sequence variants. For the disease identification base on NGS data, these technologies use variant calling. The variant is Differences in human DNA sequence, it affects the way human body functions or some of the variant disrupts the body’s function. Variant calling finds all type of variants, most of the variants are benign and some of them due to disease. Regarding it, we need approaches to filter out benign and pathogenic variants. Most of the researches applied to many tools to diagnose Chromosomal abnormalities to diagnose disease. We implement a platform that presents a combined predictor model to detect potentially pathogenic variants. This method attempts to use the best and most effective features to identify disease variants. The features used are divided into two categories, the first one based on the well-known variant pathogenic prediction tools such as SIFT and the second one are biological features. To detect pathogenic variant, we work on DNA bases changes analysis on mutation characterizes. We used clinically significant variants data-set called ClinVar database to train our model. The predictor model, use a configurable Ensemble Strategy to achieve a more accurate model and reduce over fitting. The results demonstrate that our model has reclassified uncertain or not-provided variants as pathogenic or benign