KDD Cup 2001

Because of the rapid growth of interest in mining biological databases, KDD Cup 2001 was focused on data from genomics and drug design. Sufficient (yet concise) information was provided so that detailed domain knowledge was not a requirement for entry. A total of 136 groups participated to produce a total of 200 submitted predictions over the 3 tasks: 114 for Thrombin, 41 for Function, and 45 for Localization.

KDD Cup 2001 Winners

The KDD Cup summary presentation from KDD-2001 is available in powerpoint, postscript, or pdf. Winners' presentations are available below.

KDD Cup 2001 Honorable Mention

Chairs

Tasks

KDD Cup 2001 involves 3 tasks, based on two data sets. The two training datasets are available from the links below, as zip files. The first dataset is a little over half a gigabyte when uncompressed and comes as a single text file, with one row per record and fields separated by commas. The second is a little over 7 megabytes uncompressed. It includes a single text file with all the data; again, the format is one row per record with comma-separated fields. But this data set is quite relational in nature, so improved accuracy may be possible by constructing more complex features or using a relational data mining technique (see the README file that comes with it). Nevertheless, we've tried to pre-compute some of the interesting relations as added fields, so that standard feature-vector algorithms can compete well. For both datasets, "names" files also are included that give the names of the fields; the names are "meaningful" only for the second dataset. For both datasets a README file is included that describes the nature of the task. The README files are repeated at the bottom of this page for those who wish to read about the data/task before choosing to download the data.

Training Data

Dataset 1: Prediction of Molecular Bioactivity for Drug Design -- Binding to Thrombin

Dataset 2: Prediction of Gene/Protein Function and Localization

Test Data

Each test set comes in a zip file with the test data and a README file. The README file describes the format and manner in which predictions for that test set should be submitted. Each person may submit only one prediction file per task (for a total of at most 3 submissions per person).

Test Data for Dataset 1: Binding to Thrombin

Test Data for Dataset 2 (both tasks): Function and Localization

Answers

The following are the keys that were used for scoring. Several points are worth noting regarding Function and Localization keys. First, submissions varied widely in the use of punctuation, case, and spelling for function and localization names. Because of this variation, we decided to have our code remove punctuation and look at only a long enough prefix of a name to distinguish it from all others -- the name was then converted into a shorter standard form. These shorter forms are the ones given in the keys below. We also handchecked entries and converted forms. Second, one gene in the test set had two localizations (contradicting our earler statement that each gene had only one localization). For this gene, the predicted localization was counted correct if it matched *either* of the correct localizations. Third, one function appeared in a test set gene but in no training set gene. This of course made it impossible to get 100% accuracy, but everyone was subject to this same constraint, and we think it just goes with the territory of a real-world task :-)

Key for Task 1: Binding to Thrombin

Key for Task 2: Function

Key for Task 3: Localization

Schedule

Answers to Questions of General Interest from Question Period 1

Answers to Questions of General Interest from Question Period 2




Further Details

Description of Dataset 1: Prediction of Molecular Bioactivity for Drug Design -- Binding to Thrombin

Drugs are typically small organic molecules that achieve their desired activity by binding to a target site on a receptor. The first step in the discovery of a new drug is usually to identify and isolate the receptor to which it should bind, followed by testing many small molecules for their ability to bind to the target site. This leaves researchers with the task of determining what separates the active (binding) compounds from the inactive (non-binding) ones. Such a determination can then be used in the design of new compounds that not only bind, but also have all the other properties required for a drug (solubility, oral absorption, lack of side effects, appropriate duration of action, toxicity, etc.).

The present training data set consists of 1909 compounds tested for their ability to bind to a target site on thrombin, a key receptor in blood clotting. The chemical structures of these compounds are not necessary for our analysis and are not included. Of these compounds, 42 are active (bind well) and the others are inactive. Each compound is described by a single feature vector comprised of a class value (A for active, I for inactive) and 139,351 binary features, which describe three-dimensional properties of the molecule. The definitions of the individual bits are not included - we don't know what each individual bit means, only that they are generated in an internally consistent manner for all 1909 compounds. Biological activity in general, and receptor binding affinity in particular, correlate with various structural and physical properties of small organic molecules. The task is to determine which of these properties are critical in this case and to learn to accurately predict the class value. To simulate the real-world drug design environment, the test set contains 636 additional compounds that were in fact generated based on the assay results recorded for the training set. In evaluating the accuracy, a differential cost model will be used, so that the sum of the costs of the actives will be equal to the sum of the costs of the inactives. In other words, it is just as important to minimize your error rate on the actives as it is to minimize your error rate on the inactives, even though the training set contains more inactive than actives (and the test set might also).

We thank DuPont Pharmaceuticals for graciously providing this data set for the KDD Cup 2001 competition. All publications referring to analysis of this data set should acknowledge DuPont Pharmaceuticals Research Laboratories and KDD Cup 2001.

Description of Dataset 2: Prediction of Gene/Protein Function and Localization

The genomes of several organisms have now been completely sequenced, including the human genome -- depending on one's definition of "completely" :-). Interest within bioinformatics is therefore shifting somewhat away from sequencing, to learning about the genes encoded in the sequence. Genes code for proteins, and these proteins tend to localize in various parts of cells and interact with one another, in order to perform crucial functions. The present data set consists of a variety of details about the various genes of one particular type of organism. Gene names have been anonymized and a subset of the genes have been withheld for testing. The two tasks are to predict the functions and localizations of the proteins encoded by the genes. A gene/protein can have more than one function, but (at least in this data set) only one localization. The other information from which function and localization can be predicted includes the class of the gene/protein, the phenotype (observable characteristics) of individuals with a mutation in the gene (and hence in the protein), and the other proteins with which each protein is known to interact.

The full data set is in Full_File.data. But please notice that the task is quite "relational." For example, one might wish to learn a rule that says a gene G has function F if G interacts with another gene G1 that has function F. We have made an effort to build such features into Full_File.data. (For example, for each gene we give the number of interacting genes with a given function -- these features are probably useful for predicting at least one or two of the functions). But participants may wish to construct their own additional features or to use a relational data mining algorithm. While this certainly can be done from Full_File.data, it may be easier to do this from the relational tables that we used to build Full_File.data. These are in Genes_relation.data and Interactions_relation.data. Each of the data files has a corresponding names file as well.

Detailed knowledge of the biology should not be necessary for this application. This is so much the case that we almost even anonymized all the other fields as well as the gene field. But in the end we decided instead to leave the other fields alone, since this might make the data set more interesting. One word of caution: your predictor for function should not use localization, and your predictor for localization should not use function, since *both* fields will be withheld from the test genes when they are provided. Also note that, because a gene may have more than one function, we will test for correct prediction of every (gene, function) pair. By the time we provide the test data, we will provide full specification of the format for submission of your predictions.


Last Changed: September 19, 2001 by dpage@cs.wisc.edu