| .names file created by George John, October 1994 | Processing: | Joel Tesler converted this back to ACTG. | * A,C,T,G -> 001,010,100,000 Seems biased against systems that can handle | categorical attributes | | | |1. TITLE: | DNA Dataset (STATLOG version) - Primate splice-junction gene sequences (DNA) | with associated imperfect domain theory | | PROBLEM DESCRIPTION | Splice junctions are points on a DNA sequence at which `superfluous' DNA is | removed during the process of protein creation in higher organisms. The | problem posed in this dataset is to recognize, given a sequence of DNA, the | boundaries between exons (the parts of the DNA sequence retained after | splicing) and introns (the parts of the DNA sequence that are spliced | out). | | PURPOSE | This problem consists of two subtasks: recognizing exon/intron | boundaries (referred to as EI sites), and recognizing intron/exon boundaries | (IE sites). (In the biological community, IE borders are referred to | a ``acceptors'' while EI borders are referred to as ``donors''.) | |2. USE IN STATLOG | | 2.1- Testing Mode | Train & Test | | 2.2- Special Preprocessing | Yes | | 2.3- Test Results | | Algorithm Sucess Rate | ========= =========== | Radial 95.90 | Dipol92 95.200 | Alloc80 94.300 | QuaDisc 94.100 | Discrim 94.100 | LogDisc 93.900 | Bayes 93.200 | Castle 92.800 | IndCart 92.700 | C4.5 92.400 | Cart 91.500 | BackProp 91.200 | BayTree 90.500 | Cn2 90.500 | Ac2 90.000 | NewId 90.000 | Cal5 86.900 | Itrule 86.500 | Smart 85.900 | KNN 84.500 | Kohonen 66.10 | Default 52.000 | LVQ 0.000 | Cascade 0.000 | |3. SOURCES and PAST USAGE | 3.1 SOURCES | (a) Creators: | - all examples taken from Genbank 64.1 (ftp site: genbank.bio.net) | - categories "ei" and "ie" include every "split-gene" | for primates in Genbank 64.1 | - non-splice examples taken from sequences known not to include | a splicing site | (b) Donor: G. Towell, M. Noordewier, and J. Shavlik, | {towell,shavlik}@cs.wisc.edu, noordewi@cs.rutgers.edu | (c) Date received: 1/1/92 | | The StaLog dna dataset is a processed vesrion of the Irvine | database described below. The main difference is that the | symbolic variables representing the nucleotides (only A,G,T,C) | were replaced by 3 binary indicator variables. Thus the original | 60 symbolic attributes were changed into 180 binary attributes. | The names of the examples were removed. The examples with | ambiguities were removed (there was very few of them, 4). | The StatLog version of this dataset was produced by Ross King | at Strathclyde University. For original details see the Irvine | database documantation. | | The nucleotides A,C,G,T were given indicator values as follows | | A -> 1 0 0 | C -> 0 1 0 | G -> 0 0 1 | T -> 0 0 0 | | The class values are | ei -> 1 | ie -> 2 | n -> 3 | 3.2 PAST USAGE | | (a) machine learning: | -- M. O. Noordewier and G. G. Towell and J. W. Shavlik, 1991; | "Training Knowledge-Based Neural Networks to Recognize Genes in | DNA Sequences". Advances in Neural Information Processing Systems, | volume 3, Morgan Kaufmann. | | -- G. G. Towell and J. W. Shavlik and M. W. Craven, 1991; | "Constructive Induction in Knowledge-Based Neural Networks", | In Proceedings of the Eighth International Machine Learning | Workshop, Morgan Kaufmann. | | -- G. G. Towell, 1991; | "Symbolic Knowledge and Neural Networks: Insertion, Refinement, and | Extraction", PhD Thesis, University of Wisconsin - Madison. | | -- G. G. Towell and J. W. Shavlik, 1992; | "Interpretation of Artificial Neural Networks: Mapping | Knowledge-based Neural Networks into Rules", In Advances in Neural | Information Processing Systems, volume 4, Morgan Kaufmann. | | (b) attributes predicted: given a position in the middle of a window | 60 DNA sequence elements (called "nucleotides" or "base-pairs"), | decide if this is a | a) "intron -> exon" boundary (ie) [These are sometimes called "donors"] | b) "exon -> intron" boundary (ei) [These are sometimes called "acceptors"] | c) neither (n) | (c) Results of study indicated that machine learning techniques (neural | networks, nearest neighbor, contributors' KBANN system) performed as | well/better than classification based on canonical pattern matching | (method used in biological literature). | | HISTORY | This dataset has been developed to help evaluate a "hybrid" learning | algorithm (KBANN) that uses examples to inductively refine preexisting | knowledge. Using a "ten-fold cross-validation" methodology on 1000 | examples randomly selected from the complete set of 3190, the following | error rates were produced by various ML algorithms (all experiments | run at the Univ of Wisconsin, sometimes with local implementations | of published algorithms). | | System Neither EI IE | ---------- ------- ----- ----- | KBANN 4.62 7.56 8.47 | BACKPROP 5.29 5.74 10.75 | PEBLS 6.86 8.18 7.55 | PERCEPTRON 3.99 16.32 17.41 | ID3 8.84 10.58 13.99 | COBWEB 11.80 15.04 9.46 | Near. Neighbor 31.11 11.65 9.09 | | Type of domain: non-numeric, nominal (one of A, G, T, C) | |************************************************************* | |4. DATASET DISCRIPTION | NUMBER OF EXAMPLES: | 3186 | | Train 2000 | Test 1186 | | NUMBER OF CLASSES: | 3 (one of 1,2,3) | | Distribution of classes | Class Train Test | ------------------------------------ | 1 464 (23.20%) 303 (25.55%) | 2 485 (24.25%) 280 (23.61%) | 3 1051 (52.55%) 603 (50.84%) | | NUMBER OF ATTRIBUTES: | 180 binary indicator variables | | Hint. Much better performance is generally observed if attributes | closest to the junction are used. | In the StatLog version, this means using | attributes A61 to A120 only. | | |CONTACTS | statlog-adm@ncc.up.pt | bob@stams.strathclyde.ac.uk | | |================================================================================ |;little lisp function to generate names: |(defun atts () | (let ((i 1)) | (while (<= i 180) | (insert (format "A%s: continuous.\n" i)) | (setq i (+ 1 i))))) exon/intron, intron/exon, none. | classes left-30: A,C,G,T. left-29: A,C,G,T. left-28: A,C,G,T. left-27: A,C,G,T. left-26: A,C,G,T. left-25: A,C,G,T. left-24: A,C,G,T. left-23: A,C,G,T. left-22: A,C,G,T. left-21: A,C,G,T. left-20: A,C,G,T. left-19: A,C,G,T. left-18: A,C,G,T. left-17: A,C,G,T. left-16: A,C,G,T. left-15: A,C,G,T. left-14: A,C,G,T. left-13: A,C,G,T. left-12: A,C,G,T. left-11: A,C,G,T. left-10: A,C,G,T. left-09: A,C,G,T. left-08: A,C,G,T. left-07: A,C,G,T. left-06: A,C,G,T. left-05: A,C,G,T. left-04: A,C,G,T. left-03: A,C,G,T. left-02: A,C,G,T. left-01: A,C,G,T. right-01: A,C,G,T. right-02: A,C,G,T. right-03: A,C,G,T. right-04: A,C,G,T. right-05: A,C,G,T. right-06: A,C,G,T. right-07: A,C,G,T. right-08: A,C,G,T. right-09: A,C,G,T. right-10: A,C,G,T. right-11: A,C,G,T. right-12: A,C,G,T. right-13: A,C,G,T. right-14: A,C,G,T. right-15: A,C,G,T. right-16: A,C,G,T. right-17: A,C,G,T. right-18: A,C,G,T. right-19: A,C,G,T. right-20: A,C,G,T. right-21: A,C,G,T. right-22: A,C,G,T. right-23: A,C,G,T. right-24: A,C,G,T. right-25: A,C,G,T. right-26: A,C,G,T. right-27: A,C,G,T. right-28: A,C,G,T. right-29: A,C,G,T. right-30: A,C,G,T.