Biological Data Mining (Chapman & Hall/Crc Data Mining and Knowledge Discovery Series) September 2009, ISBN: 978-1-4200-8684-3 Synopsis -------- This 733-page book examines the concepts, problems, progress, and trends in developing and applying data mining techniques in genome biology, a rapidly growing field of study. By studying the concepts and case studies presented in the book, readers can gain significant insight and develop practical solutions in future biological data mining projects. Editors ------- Prof. Jake Y. Chen Indiana University School of Informatics Purdue University School of Science Department of Computer and Information Science Indiana Center for Systems Biology and Personalized Medicine Indianapolis, IN 46202 USA Email: jakechen@iupui.edu Web site: http://bio.informatics.iupui.edu/ Prof. Stefano Lonardi Department of Computer Science and Engineering Institute for Integrative Genome Biology Center for Plant Cell Biology University of California Riverside, CA 92521 USA Email: stelo@cs.ucr.edu Web site: http://www.cs.ucr.edu/~stelo/ Chapters -------- 1. "Consensus Structure Prediction for RNA Alignments" by Junilda Spirollari and Jason T.L. Wang 2. "Invariant Geometric Properties of Secondary Structure Elements in Proteins" by Matteo Comin, Concettina Guerra, and Giuseppe Zanotti 3. "Discovering 3D Motifs in RNA" by Alberto Apostolico, Giovanni Ciriello, Christine E. Heitsch, and Concettina Guerra 4. "Protein Structure Classification Using Machine Learning Methods" by Yazhene Krishnaraj and Chandan Reddy 5. "Protein Surface Representation and Comparison: New Approaches in Structural Proteomics" by Lee Sael and Daisuke Kihara 6. "Advanced Graph Mining Methods for Protein Analysis" by Yi-Ping Phoebe Chen, Jia Rong, and Gang Li 7. "Predicting Local Structure and Function of Proteins" by Huzefa Rangwala and George Karypis 8. "Computational Approaches for Genome Assembly Validation" by Jeong-Hyeon Choi, Haixu Tang, Sun Kim, and Mihai Pop 9. "Mining Patterns of Epistasis in Human Genetics" by Jason H. Moore 10. "Discovery of Regulatory Mechanisms from Gene Expression Variation by eQTL Analysis" by Yang Huang, Jie Zheng, and Teresa M. Przytycka 11. "Statistical Approaches to Gene Expression Microarray Data Preprocessing" by Megan Kong, Elizabeth McClellan, Richard H. Scheuermann, and Monnie McGee 12. "Application of Feature Selection and Classification to Computational Molecular Biology" by Paola Bertolazzi, Giovanni Felici, and Giuseppe Lancia 13. "Statistical Indices for Computational and Data-Driven Class Discovery in Microarray Data" by Raffaele Giancarlo, Davide Scaturro, and Filippo Utro 14. "Computational Approaches to Peptide Retention Time Prediction for Proteomics" by Xiang Zhang, Cheolhwan Oh, Catherine P. Riley, Hyeyoung Cho, and Charles Buck 15. "Inferring Protein Functional Linkage Based on Sequence Information and Beyond" by Li Liao 16. "Computational Methods for Unraveling Transcriptional Regulatory Networks in Prokaryotes" by Dongsheng Che and Guojun Li 17. "Computational Methods for Analyzing and Modeling Biological Networks" by Natashaa Przulj and Tijana Milenkovi 18. "Statistical Analysis of Biomolecular Networks" by Jing-Dong J. Han and Chris J. Needham 19. "Beyond Information Retrieval: Literature Mining for Biomedical Knowledge Discovery" by Javed Mostafa, Kazuhiro Seki, and Weimao Ke 20. "Mining Biological Interactions from Biomedical Texts for Efficient Query Answering" by Muhammad Abulaish, Lipika Dey, and Jahiruddin 21. "Ontology-Based Knowledge Representation of Experiment Metadata in Biological Data Mining" by Richard H. Scheuermann, Megan Kong, Carl Dahlke, Jennifer Cai, Jamie Lee, Yu Qian, Burke Squires, Patrick Dunn, Jeff Wiser, Herb Hagler, Barry Smith, and David Karp 22. "Redescription Mining and Applications in Bioinformatics" by Naren Ramakrishnan and Mohammed J. Zaki 23. "Data Mining Tools and Techniques for Identification of Biomarkers for Cancer" by Mick Correll, Simon Beaulah, Robin Munro, Jonathan Sheldon, Yike Guo, and Hai Hu 24. "Cancer Biomarker Prioritization: Assessing the in vivo Impact of in vitro Models by in silico Mining of Microarray Database, Literature, and Gene Annotation" by Chia-Ju Lee, Zan Huang, Hongmei Jiang, John Crispino, and Simon Lin 25. "Biomarker Discovery by Mining Glycomic and Lipidomic Data" by Haixu Tang, Mehmet Dalkilic, and Yehia Mechref 26. "Data Mining Chemical Structures and Biological Data" by Glenn J. Myatt and Paul E. Blower Preface of the Book ------------------- Modern biology has become an information science. Since the invention of DNA sequencing method by Sanger in the late seventies, public repositories of genomic sequences have been growing exponentially, doubling in size every sixteen months - a rate often compared to the growth of semiconductor transistor densities in CPUs known as Moore's Law. In the nineties, the public-private race to sequence the human genome further intensified the fervor to generate high-throughput biomolecular data from highly parallel and miniaturized instruments. Today, sequencing data from thousands of genomes, including plants, mammals, and microbial genomes are accumulating at an unprecedented rate. The advent of second-generation DNA sequencing instruments, high-density cDNA microarrays, tandem mass spectrometers, and high-power NMRs, have fueled the growth of molecular biology into a wide spectrum of disciplines such as personalized genomics, functional genomics, proteomics, metabolomics, and structural genomics. Few experiments in molecular biology and genetics performed today can afford to ignore the vast amount of biological information accessible publicly. Suddenly, molecular biology and genetics have become data rich. Biological data mining is a data-guzzling turbo engine for post-genomic biology, driving the competitive race towards unprecedented biological discovery opportunities in the 21st century. Classical bioinformatics emerged from the study of macromolecules in molecular biology, biochemistry and biophysics. Analysis, comparison, and classification of DNA and protein sequences were the dominant theme of bioinformatics in the early nineties. Machine learning mainly focused on predicting genes and proteins functions from their sequences and structures. The understanding of cellular functions and processes underlying complex diseases were out of reach. Bioinformatics scientists were a rare breed, and their contribution to molecular biology and genetics was considered marginal, because the computational tools available then for biomolecular data analysis were far more primitive than the array of experimental techniques and assays that were available to life scientists. Today, we are now witnessing the reversal of these past trends. Diverse sets of data types that cover a broad spectrum of genotypes and phenotypes, particularly those related to human health and diseases, have become available. Many interdisciplinary researchers, including applied computer scientists, applied mathematicians, biostatisticians, biomedical researchers, clinical scientists, and biopharmaceutical professionals, have discovered in biology a gold mine of knowledge leading to many exciting possibilities: unraveling of tree of life, harnessing the power of microbial organisms for renewable energy, finding new ways to diagnose disease early, and developing new therapeutic compounds that save lives. Much of the experimental high-throughput biology data are generated and analyzed 'in haste', therefore leaving plenty of opportunities for knowledge discovery even after the original data is released. Most of the bets on the race to separate the wheat from the chaff have been placed on biological data mining techniques. After all, when easy, straightforward, first-pass data analysis hasn't yielded novel biological insights, data mining techniques must be able to help - or, many presumed so. In reality, biological data mining is still much of an 'art', successfully practiced by a few bioinformatics research groups that occupy themselves in solving real-world biological problems. Unlikely data mining in business, where the major concerns are often related to the bottom line - profit, the goals of biological data mining can be as diverse as the spectrum of biological questions that exist. In the business domain, association rules discovered between sales items are immediately actionable; in biology, any unorthodox hypothesis produced by computational models has to be first red-flagged and is lucky to be validated experimentally. In the internet business domain, classification, clustering, and visualization of blogs, network traffic patterns, and news feeds add significant values to regular internet users who are unaware of high-level patterns that may exist in the data set; in molecular biology and genetics, any clustering or classification of the data presented to biologists may promptly elicit questions like 'great, but how and why did it happen?' or 'how can you explain these results in the context of the biology I know?' The majority of general-purpose data mining techniques do not take into considerations prior knowledge domain of the biological problem, leading them to often underperform hypothesis-driven biological investigative techniques. The high level of variability of measurements inherent in many types of biological experiments or samples, the general unavailability of experimental replicates, the large number of hidden variables in the data, and the high correlation of biomolecular expression measurements also constitute significant challenges in the application of classical data mining methods in biology. Many biological data mining projects are attempted and then abandoned, even by experienced data mining scientists. In the extreme cases, large-scale biological data mining efforts are jokingly labeled as fishing expeditions and dispelled, in national grant proposal review panels. This book represents a culmination of our past research efforts in biological data mining. Through this book, we wanted to showcase a small, but noteworthy sample of successful projects involving data mining and molecular biology. Each chapter of the book is authored by a distinguished team of bioinformatics scientists whom we invited to offer the readers the widest possible range of application domains. To ensure high quality standards, each contributed chapter went through standard peer-reviews and a round of revisions. Contributed chapters have been grouped into four major sections. The first section, entitled Sequence, Structure, and Function, collects contributions on data mining techniques designed to analyze biological sequences and structures with the objective of discovering novel functional knowledge. The second section on Genomics, Transcriptomics, and Proteomics, contains studies addressing emerging large-scale data mining challenges in analyzing high-throughput 'omics' data. The chapters in the third section, entitled Functional and Molecular Interaction Networks, address emerging system-scale molecular properties and their relevance to cellular functions. The fourth section is about Literature, Ontology, and Knowledge Integrations, and it collects chapters related to knowledge representation, information retrieval, and data integration for structured and unstructured biological data. The contributed works in the fifth and last section, entitled Genome Medicine Applications, address emerging biological data mining applications in medicine. We believe this book can serve as a valuable guide to the field for graduate students, researchers, and practitioners. We hope that the wide range of topics covered will allow readers to appreciate the extent of the impact of data mining in molecular biology and genetics. For us, research in data mining and its applications to biology and genetics is fascinating and rewarding. It may even help save human lives one day. This field offers great opportunities and rewards if one is prepared to learn molecular biology and genetics, design user-friendly software tools under the proper biological assumptions, and validate all discovered hypothesis rigorously using appropriate models. Jake Y. Chen and Stefano Lonardi At Indianapolis, IN and Riverside, CA, USA February 2009