This directory contains the files for GALA, human April 2003 release.
   aka hg15, build33, gs.16

Directory structure:
	README -- This file
        align_3way/
                log.chr*.dat.gz
        alignments/
		concise/
		mm3/
                	gapchr*.dat.gz
                	localchr*.dat.gz
			seqfile.dat.gz
		rn3/  *not loaded, and may not get loaded
        alternate_genes/
                *_Genes.dat.gz
                *_Genes_exons.dat.gz
        chrom_info.dat.gz
        conserved70.dat.gz
        conserved80.dat.gz
        conserved90.dat.gz
        conserved_tfbs/
                chr*.dat.gz
	cpgIsland.dat.gz          
        enzyme_info.dat.gz
        gcPercent.dat.gz
        genes/
                disorders.dat.gz
                gene_dbids.dat.gz
                mprot.dat.gz
                exons.dat.gz
                gene_ontology.dat.gz
                prod.dat.gz
                expression.dat.gz
                genename.dat.gz
                tissue_info.dat.gz
                gene_alias.dat.gz
                genes.dat.gz
                tissues.dat.gz
                gene_cdd.dat.gz
                go_defs_info.dat.gz
		gene_orthologs.dat.gz
        hbvar_range.dat.gz
        lscores/
                chr*_Lscore.dat.gz
        microRNA.dat.gz
        multiple_aligns.dat
        mult_align_segments.dat
        promoter_func.dat.gz
        recombRate.dat.gz
	regulatory.dat.gz
	repeats/
		repeatschr*.dat.gz
	restriction_sites/
		chr*.dat.gz
	rp/
		chr*.regpotent.dat.gz
	rp_multiple/
		chr*3wayRP.dat.gz
        snpall.dat.gz
        snps.dat.gz
	tables.txt
	tfbinding_sites.dat.gz

NOTES on general file format
.gz	The files were compressed using gzip to save space and download time.
	
.dat	The files are comma delimited with text fields enclosed in single 
	quotes.  Quotes within text are escaped by doubling them.
	A Unix newline separates the table rows.

NOTES on individual files/tables.
align_3way -
	This is a multiple alignment between hg15, mm3, rn3 with information
	on whether the region is a repeat.
	source: http://bio.cse.psu.edu
alignments -
	This is a pairwise alignment between hg15 and mm3.  The scoring is 
	done using multiz.  These files can be used to generate lav files.
	The subdirectory concise contains the alignments in concise format
	used by tools such as strong-hits.
	source: http://bio.cse.psu.edu
	UPDATE: these tables were updated Feb 2 to the second species using
	  genomic coordinates.  Also chrom2 was added to gap_free_align.
alternate genes -
	The alternate gene models are represented by pairs of files named
	after the track name.  These are all on freeze hg15.
	download source: http://genome.ucsc.edu/
        source: tracks by different sources, indicated by name
chrom_info -
	The chromosome name, start, and stop, and species.
     	download source: http://genome.ucsc.edu/
conserved*.dat -
	The regions that are found conserved in the pairwise alignments 
	using strong-hits minus exons from GALA's default set of genes.
	Three levels are measured 70, 80, 90 percent identity.
        source: http://bio.cse.psu.edu
	UPDATE: this table had species added, and was recalculated Feb 2.
	  A dump is made after the load so deleted rows aren't shown.
conserved_tfbs -
	Transcription factor binding sites that are conserved in all 3 species,
	hg15, mm3, rn3.  Using the same matrices as binding sites.
        source: http://bio.cse.psu.edu
cpgIsland.dat -
	CpG islands in the hg15 sequence.
	source: http://genome.ucsc.edu/
enzyme_info -
	The enzymes and patterns used for the restriction sites.
	source: http://www.neb.com/neb/products/res_enzymes/re_update_frame.html
gc_percent.dat -
	The GC Percent track from UCSC for the hg15 freeze.
        download source: http://genome.ucsc.edu/
genes - 
	The default set of genes are the Locus Link genes from the genbank
	summary files.  The Locus Link ID is then used to tie the gene 
	coordinates to more annotations on the genes.  Alternately spliced
	genes have each listed as a separate gene, therefore a we assigned
	a unique ID to each gene rather than using the Locus Link ID as the
	primary key.  Any genes in the RefSeq track at UCSC that are not in 
	the set from the genbank files are added.
	sources:
	-evid code from LL map data found at
	ftp.ncbi.nih.gov/genomes/H_sapiens/maps/mapview/elements/LOCUS_objects.gz
	-genbank files *.gbs
	ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_N/*.gbs
	-refseq genes
	http://genome.ucsc.edu/
	-annotations for genes
	ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz
			/repository/UniGene/Hs.data.gz
			/repository/OMIM/genemap
	http://expression.gnf.org/cgi-bin/index.cgi
	-added ensembl gene ids to gene_dbids table 3-31-04
	-added gene_orthologs table and data 3-31-04
hbvar_range.dat -
	This files contains the database IDs and coordinates for the variants
	in the HbVar database to aid in connecting the databases.
	source: http://globin.cse.psu.edu/hbvar/
lscores -
	Log likelyhood scores on conservation.
	source: http://genome.ucsc.edu/
microRNA.dat -
	Micro RNA's
	reference: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=12624257
multiple_aligns.dat -
mult_align_segments.dat -
	Alignments between hg15, mm3, and rn3 using humor.
	source: http://bio.cse.psu.edu
	UPDATE: error was found in second species coordinates Jan 30, 2004
promoter_func.dat -
	Functional promoters.
	reference: http://www-shgc.stanford.edu/myerslab/
recombRate.dat -
	Recombination rates as calculated by deCODE Genetics, Genethon,
	and Marshfield.
	download source: http://genome.ucsc.edu/
regulatory.dat -
	Known regulatory regions
	source: http://bio.cse.psu.edu
repeats -
	Repetitive elements.
	download source: http://genome.ucsc.edu/
restriction_sites -
	Matches to the enzyme pattern for 128 enzymes on the hg15 sequence.
	source: http://bio.cse.psu.edu
rp -
	Regulatory potential calculated using pairwise alignments above.
	source: http://bio.cse.psu.edu
rp_multiple -
	Regulatory potential calculated on multiple alignments from table above.
	align hg15,mm3,rn3
	source: http://bio.cse.psu.edu
snpall.dat -
snps.dat -
	The Single Nucleotide Polymorphisms.  
	sources: ftp.ncbi.nih.gov/snp/mssql/data/SubPopAllele.gz, 
                                   "            /ANSI1_flat/ds_flat_chN.flat.gz
	 	 http://genome.ucsc.edu/  files:snpNih.txt.gz, snpTsc.txt.gz
tables.txt -
	The table definitions.  The field types are listed as integer which is
	from -2,147,483,648 to 2,147,483,647.  varchar(N) which is a 
	character field with up to N characters. smallint with is from
	-32,768 to 32,767. Fields generated from other fields have the 
	equation used to generate the values and are used for indexing.
tfbinding_sites -
        Matches to 164 transcription factor matrices.  The matrices come
        from Transfac (registration required).  These are matches on the
        hg15 sequence of 75% or better.
        source: http://bio.cse.psu.edu