Talks and presentations

See a map of all the places I've given a talk!

Building communities with Cloud, Containers and CyVerse

June 18, 2018

Talk, Biofrontiers institute, Colorado State University, Colorado

In this keynote talk, I presented how CyVerse is helping to build communities using Academic clouds such Atmosphere, Jetstream and Containers in Discovery Environment.

WQ-Maker: A Flexible and Scalable Genome Annotation Pipeline on Jetstream Cloud

January 15, 2017

Talk, Plant and Animal Genome (PAG) conference, San Diego, California

National Science Foundation (NSF) funded Jetstream is a self-provisioned, scalable science and engineering cloud environment which allows researchers to analyze their data on customized virtual machines (VMs) in a cloud-based environment. Jetstream is freely available to US based researchers. MAKER is a flexible and scalable genome annotation pipeline used for de novo annotation of newly sequenced genomes, for updating existing genome annotations, or just to combine annotations, evidence, and quality control statistics. Installing and using MAKER on multiuser HPC systems comes with challenges associated with software version dependencies. Utilizing cloud-based systems for large-scale annotations using MAKER provides more flexibility in configuration, but have limitations such as no shared file system and need to balance work between multiple instances. WQ-MAKER, a customized version of MAKER with Work queue based distributed computing framework is designed to run on multiple VMs in the cloud making it feasible to readily scale annotation tasks that overcomes the limitations of shared file system requirement. WQ-MAKER framework also leverages MPI capability of MAKER, making full use of available cores on each cloud instance. We have created a Jetstream image of WQ-MAKER and is freely available to community members to annotate their genomes. WQ-MAKER efficiently runs MAKER simultaneously on multiple Jetstream instances, greatly speeding up the annotation run-time.

Bringing your bioinformatics tools to cyverse′s discovery environment using docker

October 20, 2016

Talks, Houston, Texas, Houston, USA

CyVerse (formerly iPlant Collaborative) is a life sciences cyberinfrastructure funded by the National Science Foundation (NSF). The infrastructure’s purpose is to scale science, domain expertise, and knowledge by providing a variety of computational tools, services, and platforms for storing, sharing, and analyzing large and diverse biological datasets. The Discovery Environment (DE) in CyVerse provides a modern web interface for running powerful computing, data, and analysis applications. By providing a consistent user interface for accessing tools and computing resources needed for specialized scientific analyses, the DE facilitates data exploration and scientific discovery. DE merges the “science gateway” functionality and the bioinformatics “work bench” with high-performance data management to allow seamless access to reusable computational workflows that can run at very large scales. It is common in bioinformatics to build new analysis methods utilizing multiple programs, libraries, and modules. However, each analysis that uses these tools requires specific versions of the operating system and underlying software. Docker is a container virtualization technology that wraps software of interest (e.g., a bioinformatics tool) together with all its software dependencies so it can run in a reproducible manner regardless of the environment. CyVerse has adopted Docker for integrating software apps that run in the DE’s Compute Cluster. The user creates a Dockerfile, which is sent to CyVerse and used to build the Docker image containing the tool. After the image has been deployed on the DE’s compute cluster, the user can build an web app in the DE to enable other researches easily use the tool.

A Hybrid Approach to Assemble and Annotate the Brassica rapa Transcriptome in the Cloud through the iPlant Collaborative and XSEDE

January 10, 2015

Talks, Plant and Animal Genome (PAG) conference, San Diego, California

Currently there are two different approaches for producing transcriptome assembly, de novo and reference-based. Each of these methods was successfully employed to assemble transcripts by aligning reads generated using RNA-Seq technologies. Both methods have advantages and disadvantages. De novo methods can define novel transcripts, as well as non-collinear and trans-spliced transcripts that result from chromosomal rearrangements. However they perform poorly on low-expressed genes, can produce chimeras and misassemblies, and are computationally intensive. In contrast, reference-based methods are computationally less demanding, tolerate sequencing errors, and detect repeats through alignment. However reference-based methods are dependent on a reference genome, assume that transcripts are collinear with the genome, and mismatched genome alignment or genome assembly errors lead to errors in transcriptome prediction. In this study we report a hybrid approach that combines the transcripts generated from de novo and reference-based strategies to generate a transcriptome assembly and subsequently annotating them. In addition to generating a transcriptome assembly, RNA-Seq was also used to improve the existing genome annotation of B. rapa using PASA software. Both transcriptome assembly and genome annotation are often rate-limiting steps requiring complex workflows, specialized software and access to high performance computing (HPC) facilities. We show how scalable cloud-computing infrastructures such as iPlant and XSEDE (distributed computing) can enable high performance bioinformatics analyses of very large next generation transcriptome sequence data. Specifically, we use iPlant for: (i) uploading, storing (iRODS) and controlled sharing of data and results, (ii) testing and development of bioinformatics pipelines and (iii) high performance computer resources provided such as XSEDE. In future we plan to deploy the hybrid transcriptome assembly and annotation pipeline as virtual machine (VM) in iPlant’s Atmosphere Cloud Service and link to XSEDE for added processing

Development and Application of Genomic Resources in Brassica rapa at Brassicas workshop

January 10, 2015

Talk, Plant and Animal Genome (PAG) conference, San Diego, California

Brassica rapa is an economically important vegetable and oilseed crop, and serves as an excellent model for evolutionary research studies. Even though the whole genome sequence of B. rapa is available, only a very few genome based resources are currently available. The advent of high-throughput next generation sequencing technologies allowing whole transcriptome sequencing (RNA-Seq) along with the development of novel computational approaches provides the opportunity for efficiently addressing this problem. Here, we report the deep sequencing of B. rapa transcriptome in order to provide a more comprehensive set of genomic resources for functional studies. As a proof-of-concept, we used the developed genomic resources for a variety of applications including genome annotation, polymorphism detection, gene-based genetic markers detection, genotyping of a mapping population, genetic map construction, QTL and eQTL mapping. We hope that the large-scale RNA sequencing effort described here, along with the development and application of the resulting resources will significantly help researchers in the mapping and functional analysis of quantitative traits in Brassica rapa.