Skip to main content

Parallel analysis of variants in multiple bear genomes: algorithmic solutions

Submission Number: 73
Submission ID: 104
Submission UUID: 44a76c9e-c446-4094-a416-953de67209a1
Submission URI: /form/project

Created: Wed, 10/14/2020 - 00:00
Completed: Wed, 10/14/2020 - 00:00
Changed: Tue, 08/02/2022 - 15:02

Remote IP address: 165.230.225.250
Submitted by: Galen Collier
Language: English

Is draft: No
Webform: Project
Parallel analysis of variants in multiple bear genomes: algorithmic solutions
CAREERS
Bear_0.png
Data-Facing (346), bioinformatics (277)
Complete

Project Leader

Andrey Grigoriev
{Empty}
{Empty}

Project Personnel

Amit Harel
{Empty}

Project Information

Introduction: This interdisciplinary proposal focuses on the application of high-performance computing (HPC) approaches to the analysis of next-generation sequencing (NGS) of the genomes brown and polar bears. It aims to improve and speed up the detection of common variants in cohorts of related genomes to establish evolutionary trajectories of the corresponding species. The work will be performed by a graduate student under the supervision of Dr. Andrey Grigoriev, Professor at the Biology Dept and Center for Computational and Integrative Biology at Rutgers-Camden. Remote work is the most likely mode of operation in this project.

Genomes of all organisms and species undergo constant change and mutations are of varying scales. Structural variants (SVs) typically affect much larger genome intervals compared to single nucleotide variants (SNVs) or short insertions/deletions (indels). Currently, comparative genomics efforts mostly focus on SNV/indels in protein coding regions, while the role of SVs (especially outside those regions) generally remains a mystery. There is an unmet need and a growing interest in understanding the effect of SVs in evolution using NGS.

Project details: Current SV finding pipelines have low accuracy and take a long time to run on many samples. Also, none of such pipelines combines multiple-sample evidence for variant detection. This precludes fast detection of the commonalities in related species and their integration with gene expression and other large-scale phenotypic datasets. We hypothesize that parallel search in multiple samples, combining weaker evidence at similar locations in similar subspecies will further improve SV prediction accuracy. Once representative groups of such variants are established, we also will investigate the possibilities of applying machine learning methods for these purposes.

We will build upon our recent success with GROM (Genome Rearrangement Omni-Mapper), a very fast and accurate novel algorithm we developed(1,2) to identify all types of genome variants (from large SVs down to one base pair change) in a single run. It has been used for the comparisons of elephant and mammoth genomes(3) and we expect it to perform well in the proposed multi-species comparisons.

References

1. Smith SD, Kawash JK, Grigoriev A. (2015) GROM-RD: resolving genomic biases to improve read depth detection of copy number variants. PeerJ, 3:e836.
2. Smith, S., Kawash, J., Grigoriev, A. (2017) Lightning-fast genome variant detection with GROM. GigaScience 6(10), 1-7.
3. Smith, S., Kawash, J., Karaiskos, S., Biluck, I., Grigoriev, A. (2017) Evolutionary adaptation revealed by comparative genome analysis of woolly mammoths and elephants. DNA Research 24(4): 359-369.

Project Information Subsection

{Empty}
{Empty}
Grad or undergrad
Interested in computational genomics research
Experienced Linux or Unix user
Comfortable working in a remote Linux environment (HPC cluster)
Experience with Python programming
Experience with C programming
{Empty}
Practical applications
{Empty}
Rutgers University–Camden
303 Cooper St
Camden, New Jersey. 08102
CR-Rutgers
12/01/2020
No
Already behind3Start date is flexible
{Empty}
03/10/2021
{Empty}
09/08/2021
  • Milestone Title: Project launch & initial familiarization
    Milestone Description: Become familiar with the GROM pipeline, data and concepts; CAREERS Program “Project Launch” Presentation (completed MAR 10)
    Completion Date Goal: 2021-03-10
    Actual Completion Date: 2021-03-10
  • Milestone Title: Joint analysis
    Milestone Description: Develop and implement an algorithm for joint analysis of multiple samples as part of GROM package.
    Completion Date Goal: 2021-04-14
    Actual Completion Date: 2021-04-14
  • Milestone Title: Implement parallelization
    Milestone Description: Develop and implement parallelization method for the package.
    Completion Date Goal: 2021-06-16
  • Milestone Title: Project wrap-up and documentation
    Milestone Description: Provide documentation for the code and results, description of algorithmic/parallelization steps and methods; CAREERS Program “Project Close” Presentation (possibly JUL or AUG)
    Completion Date Goal: 2021-08-18
{Empty}
See above, at least one paper with biological conclusions planned. Potential methodological publication as well.
Intricacies of genomics, the unexpected ways genomes are changing, puzzles of genome rearrangements, how genome sequencing helps unravel these.
{Empty}
Effort and requirements involved in recruiting and training junior-level bioinformatics specialists in genomics filed.
{Empty}
{Empty}

Final Report

The student has improved a parallelization method for the variant calling. This should help produce better results for our GROM software, which can be used by many biologists in the genomics field.
N/A
N/A
A student with math background became involved in the computational biology application and became interested in the field. We are continuing our collaboration.
N/A
N/A
N?A
Bear evolution and survival is often seen through the prism of climate change.
Regular communication was key to the progress of this project.
This work has led to an improvement in the parallelization method for the variant calling from whole-genome data for our GROM software, which can be used by many biologists in the genomics field.