Abstract (M. Bernstein):
The NCBI’s Sequence Read Archive (SRA) is a large, public repository of raw, next generation sequencing data. This resource promises great biological insight if one could analyze the data in the aggregate; however, the data remain largely underutilized, in part, due to the poor structure of the metadata associated with each biological sample. The rules governing submissions to the SRA do not dictate a standardized set of terms that should be used to describe the biological samples from which the sequencing data are derived. As a result, the metadata include many synonyms, spelling variants, and references to outside sources of information. Furthermore, manual annotation of the data remains intractable due to the large number of samples in the archive. For these reasons, it has been difficult to perform large-scale analyses that study the relationships between biomolecular processes and phenotype across diverse diseases, tissues, and cell types present in the SRA.
In this talk, I will speak about our recent efforts to standardize the metadata in the SRA. These efforts involve mapping samples to terms in biomedical ontologies, labeling each sample with a sample-type category, and extracting real-valued properties. To accomplish these tasks we developed a novel computational pipeline.