| .names file created by George John, October 1994
| Processing:
| Joel Tesler converted this back to ACTG.
| * A,C,T,G -> 001,010,100,000  Seems biased against systems that can handle
|   categorical attributes
| 
| 
|
|1. TITLE:
|	DNA Dataset (STATLOG version) - Primate splice-junction gene sequences (DNA)
|        with associated imperfect domain theory
|
|	PROBLEM DESCRIPTION
|	Splice junctions are points on a DNA sequence at which `superfluous' DNA is
|	removed during the process of protein creation in higher organisms.  The
|	problem posed in this dataset is to recognize, given a sequence of DNA, the
|	boundaries between exons (the parts of the DNA sequence retained after
|	splicing) and introns (the parts of the DNA sequence that are spliced
|	out). 
|
|	PURPOSE
|	This problem consists of two subtasks: recognizing exon/intron
|	boundaries (referred to as EI sites), and recognizing intron/exon boundaries
|	(IE sites). (In the biological community, IE borders are referred to
|	a ``acceptors'' while EI borders are referred to as ``donors''.)
| 
|2. USE IN STATLOG
|
|	2.1- Testing Mode		
|		Train & Test
|
|	2.2- Special Preprocessing	
|		Yes
|
|	2.3- Test Results
|		
|		Algorithm	Sucess Rate
|		=========	===========
|		Radial		95.90
|		Dipol92		95.200
|		Alloc80		94.300
|		QuaDisc		94.100
|		Discrim		94.100
|		LogDisc		93.900
|		Bayes		93.200
|		Castle		92.800
|		IndCart		92.700
|		C4.5		92.400
|		Cart		91.500
|		BackProp	91.200
|		BayTree		90.500
|		Cn2		90.500
|		Ac2		90.000
|		NewId		90.000
|		Cal5		86.900
|		Itrule		86.500
|		Smart		85.900
|		KNN		84.500
|		Kohonen		66.10
|		Default		52.000
|		LVQ		0.000
|		Cascade		0.000
|
|3. SOURCES and PAST USAGE
|   3.1 SOURCES
|   	(a) Creators: 
|       		- all examples taken from Genbank 64.1 (ftp site: genbank.bio.net)
|       		- categories "ei" and "ie" include every "split-gene" 
|         	for primates in Genbank 64.1
|       		- non-splice examples taken from sequences known not to include
|         	a splicing site 
|   	(b) Donor: G. Towell, M. Noordewier, and J. Shavlik, 
|              {towell,shavlik}@cs.wisc.edu, noordewi@cs.rutgers.edu
|   	(c) Date received: 1/1/92
|
|	The StaLog dna dataset is a processed vesrion of the Irvine 
|	database described below.  The main difference is that the 
|	symbolic variables representing the nucleotides (only A,G,T,C) 
|	were replaced by 3 binary indicator variables.  Thus the original 
|	60 symbolic attributes were changed into 180 binary attributes.  
|	The names of the examples were removed.  The examples with 
|	ambiguities were removed (there was very few of them, 4).   
|	The StatLog version of this dataset was produced by Ross King
|	at Strathclyde University.   For original details see the Irvine 
|	database documantation.
|
|	The nucleotides A,C,G,T were given indicator values as follows
|
|		A -> 1 0 0
|    		C -> 0 1 0
|    		G -> 0 0 1
|    		T -> 0 0 0
|
|	The class values are 
|		ei -> 1
|                ie -> 2
|                n  -> 3
|   3.2 PAST USAGE
|
|	(a) machine learning:
|       	-- M. O. Noordewier and G. G. Towell and J. W. Shavlik, 1991; 
|           "Training Knowledge-Based Neural Networks to Recognize Genes in 
|           DNA Sequences".  Advances in Neural Information Processing Systems,
|           volume 3, Morgan Kaufmann.
|
|	-- G. G. Towell and J. W. Shavlik and M. W. Craven, 1991;  
|           "Constructive Induction in Knowledge-Based Neural Networks",  
|           In Proceedings of the Eighth International Machine Learning
|	   Workshop, Morgan Kaufmann.
|
|        -- G. G. Towell, 1991;
|           "Symbolic Knowledge and Neural Networks: Insertion, Refinement, and
|           Extraction", PhD Thesis, University of Wisconsin - Madison.
|
|        -- G. G. Towell and J. W. Shavlik, 1992;
|           "Interpretation of Artificial Neural Networks: Mapping 
|           Knowledge-based Neural Networks into Rules", In Advances in Neural
|           Information Processing Systems, volume 4, Morgan Kaufmann.
|
|   	(b) attributes predicted: given a position in the middle of a window
|       		60 DNA sequence elements (called "nucleotides" or "base-pairs"),
|       		decide if this is a
|		a) "intron -> exon" boundary (ie) [These are sometimes called "donors"]
|		b) "exon -> intron" boundary (ei) [These are sometimes called "acceptors"]
|		c) neither                      (n)
|   	(c) Results of study indicated that machine learning techniques (neural
|       		networks, nearest neighbor, contributors' KBANN system) performed as
|       		well/better than classification based on canonical pattern matching
|       		(method used in biological literature).
|
|	HISTORY
|	This dataset has been developed to help evaluate a "hybrid" learning
|   	algorithm (KBANN) that uses examples to inductively refine preexisting
|   	knowledge.  Using a "ten-fold cross-validation" methodology on 1000
|   	examples randomly selected from the complete set of 3190, the following 
|   	error rates were produced by various ML algorithms (all experiments
|   	run at the Univ of Wisconsin, sometimes with local implementations
|   	of published algorithms). 
|
|                System	       Neither    EI      IE
|                ----------     -------  -----   -----
|		KBANN    	 4.62    7.56    8.47
|		BACKPROP    	 5.29    5.74   10.75
|		PEBLS    	 6.86    8.18    7.55
|		PERCEPTRON    	 3.99   16.32   17.41
|		ID3    		 8.84   10.58   13.99
|		COBWEB   	11.80   15.04    9.46
|		Near. Neighbor	31.11   11.65    9.09
|	     	
|   	Type of domain: non-numeric, nominal (one of A, G, T, C)
|
|*************************************************************
|
|4. DATASET DISCRIPTION
|	NUMBER OF EXAMPLES: 
|		3186
|
|		Train	2000
|		Test	1186
|
|	NUMBER OF CLASSES: 
|		3 (one of 1,2,3)
|
|		Distribution of classes
|			Class	Train		Test
|			------------------------------------	
|			1	 464 (23.20%)	303 (25.55%)
|		 	2	 485 (24.25%)	280 (23.61%)
|		 	3	1051 (52.55%)	603 (50.84%)
|	
|	NUMBER OF ATTRIBUTES:
|		180 binary indicator variables
|
|	Hint.   Much better performance is generally observed if attributes
|		closest to the junction are used.
|		In the StatLog version, this means using
|		attributes A61 to A120 only.
|
|
|CONTACTS
|	statlog-adm@ncc.up.pt
|	bob@stams.strathclyde.ac.uk
|	
|
|================================================================================
|;little lisp function to generate names:
|(defun atts ()
|  (let ((i 1))
|    (while (<= i 180)
|      (insert (format "A%s: continuous.\n" i))
|     (setq i (+ 1 i)))))

exon/intron, intron/exon, none. | classes
left-30: A,C,G,T.
left-29: A,C,G,T.
left-28: A,C,G,T.
left-27: A,C,G,T.
left-26: A,C,G,T.
left-25: A,C,G,T.
left-24: A,C,G,T.
left-23: A,C,G,T.
left-22: A,C,G,T.
left-21: A,C,G,T.
left-20: A,C,G,T.
left-19: A,C,G,T.
left-18: A,C,G,T.
left-17: A,C,G,T.
left-16: A,C,G,T.
left-15: A,C,G,T.
left-14: A,C,G,T.
left-13: A,C,G,T.
left-12: A,C,G,T.
left-11: A,C,G,T.
left-10: A,C,G,T.
left-09: A,C,G,T.
left-08: A,C,G,T.
left-07: A,C,G,T.
left-06: A,C,G,T.
left-05: A,C,G,T.
left-04: A,C,G,T.
left-03: A,C,G,T.
left-02: A,C,G,T.
left-01: A,C,G,T.
right-01: A,C,G,T.
right-02: A,C,G,T.
right-03: A,C,G,T.
right-04: A,C,G,T.
right-05: A,C,G,T.
right-06: A,C,G,T.
right-07: A,C,G,T.
right-08: A,C,G,T.
right-09: A,C,G,T.
right-10: A,C,G,T.
right-11: A,C,G,T.
right-12: A,C,G,T.
right-13: A,C,G,T.
right-14: A,C,G,T.
right-15: A,C,G,T.
right-16: A,C,G,T.
right-17: A,C,G,T.
right-18: A,C,G,T.
right-19: A,C,G,T.
right-20: A,C,G,T.
right-21: A,C,G,T.
right-22: A,C,G,T.
right-23: A,C,G,T.
right-24: A,C,G,T.
right-25: A,C,G,T.
right-26: A,C,G,T.
right-27: A,C,G,T.
right-28: A,C,G,T.
right-29: A,C,G,T.
right-30: A,C,G,T.