Office of Research Information Services

Office of the Chief Information Officer

Data Science

Data Science at OCIO (under development)

While the Smithsonian Institution has a long history of building physical collections, more recently, researchers and staff have also been generating large digital collections, such as images from mass digitization efforts, DNA sequences from genomics, and data from ecological sensors. Given the large amount of data being generated from these efforts, the Office of Research Information Services (ORIS) has established a new effort in data science.

The Data Science Team:

Rebecca DikowResearch Data Scientist

Rebecca received her PhD in Evolutionary Biology from the University of Chicago and was previously the Biodiversity Genomics postdoctoral fellow at the Smithsonian.

Paul FrandsenResearch Data Scientist

Paul received his PhD in Entomology from Rutgers University. He is interested in the genomics and phylogenetics and the development of bioinformatics tools for genome analysis.

 

The Data Science Team has 3 major priorities, research, training, and infrastructure. We provide examples from each category below:

Research

We collaborate with researchers across the Smithsonian on genomics, deep learning, and other big data projects. Current collaborations include generating and analyzing genomes:

-Robber fly, mydas fly, horse fly, and flesh fly  (Diptera): with Torsten Dikow (NMNH), Mauren Turcatel, and Eliana Buenaventura

-Ironweed (Compositae): with Vicki Funk (NMNH), Vanessa Gonzalez (GGI), and Jennifer Mandel (University of Memphis)

  • Poster presented at BioGenomics 2017 meeting

-Caddisfly (Trichoptera): with Vanessa Gonzalez (GGI)

-Red Siskin: with Mike Braun (NMNH), Brian Coyle, and HC Lim

-Raccoon and Kinkajou: with Jesus Maldonado (NZP) and Mirian Tsuchiya

  • Poster presented at BioGenomics 2017 meeting

Deep Learning:

We are collaborating with the SI Digitization Program Office and NMNH Department of Botany on developing deep learning models for identifying specimens from digitized Herbarium sheets. Smithsonian DPO has digitized more than 1,000,000 Herbarium sheets.

Phylogenomics:

-Whole-genome phylogeny across the tree of life 

  • Poster presented at Cold Spring Harbor Biological Data Science 2016 meeting: 

-1000 Insect Transcriptomes Project (1KITE.org)

-Compositae gene capture: with Vicki Funk (NMNH), and Jennifer Mandel

Evolutionary Genomics:

-Heliconius (Lepidoptera) whole-genome phylogeny with Nate Edelman (Harvard) and Owen McMillan (STRI)

Ecological genomics:

-Orchid mycorrhizal fungal genomics and transcriptomics: with Melissa McCormick (SERC)

-Environmental parasite gene capture: with Katrina Lohan (SERC)

Genome annotation:

We are partnering with the Intel Corporation and Amazon Web Services to find innovative solutions to computational challenges in biodiversity genomics by testing state-of-the-art hardware, designing new genome annotation software, and implementing complex pipelines, all focused on the biodiverse data that are unique to the Smithsonian. 

HPC infrastructure

OCIO maintains a high performance computing cluster made up of ~3,300 CPUs and 18 TB RAM, with ~250 TB for data storage. It is used by scientists, graduate and postdoctoral fellows, undergraduate students, and research associates at SAO, NMNH, STRI, SERC, and SCBI. More than 300 users have HPC accounts. OCIO’s Research Computing team and SAO staff maintain the cluster.

Galaxy

We are implementing a Smithsonian Galaxy Instance and developing Galaxy tools for biodiversity genomics.

Training

We have offered more than 10 trainings over the past 18 months that have reached more than 200 members of the SI community. Here is a link to our training materials: www.github.com/SmithsonianWorkshops