This directory contains the files for GALA, Rat June 2003 release. rn3 Directory structure: Readme.txt -- This file align_3way/ log.chr*.dat.gz alignments/ gapchr*.dat.gz localchr*.dat.gz seqfile.dat.gz alternate_genes/ *_Genes.dat.gz *_Genes_exons.dat.gz tfbinding_sites.new.gz chromInfo_rat.dat.gz conserved70_rn3hg15.dat.gz conserved80_rn3hg15.dat.gz conserved90_rn3hg15.dat.gz conserved70_rn3mm3.dat.gz conserved80_rn3mm3.dat.gz conserved90_rn3mm3.dat.gz conserved_tfbs/ chr*.gala.gz cpgIsland.dat.gz enzyme_info.dat.gz gcPercent.dat.gz isochore.dat.gz genes/ gene_dbids.dat.gz mprot.dat.gz exons.dat.gz gene_prod.dat.gz genename.dat.gz gene_alias.dat.gz genes.dat.gz gene_expr.dat.gz gene_cdd.dat.gz gene_ontology.dat.gz go_defs_info.dat.gz strain.dat.gz qtl.dat.gz rn3.orthology.dat.gz multiple_aligns.dat.gz mult_align_segments.dat.gz *not available promoter_func.dat.gz recombRate.dat.gz *not available regulatory.dat.gz repeats/ repeatschr*.dat.gz restriction_sites/ chr*.dat.gz rp_multiple/ chr*.3wayRP.dat.gz snpall.dat.gz snps.dat.gz tables.txt NOTES on general file format .gz The files were compressed using gzip to save space and download time. .dat The files are comma delimited with text fields enclosed in single quotes. Quotes within text are escaped by doubling them. A Unix newline separates the table rows. NOTES on individual files/tables. align_3way - This is a multiple alignment between rn3, hg15, mm3 with information on whether the region is a repeat. source: http://bio.cse.psu.edu alignments - This is a pairwise alignment between rn3 and hg15. The scoring is done using multiz. These files can be used to generate lav files. source: http://bio.cse.psu.edu alternate genes - The alternate gene models are represented by pairs of files named after the track name. These are all on freeze rn3. download source: http://genome.ucsc.edu/ source: tracks by different sources, indicated by name binding sites - Matches to 164 transcription factor matrices. The matrices come from Transfac (registration required). These are matches on the rn3 sequence of 75% or better. source: http://bio.cse.psu.edu chrom_info - The chromosome name, start, and stop, and species. download source: http://genome.ucsc.edu/ conserved*.dat - The regions that are found conserved in the pairwise alignments using strong-hits minus exons from GALA's default set of genes. Three levels are measured 70, 80, 90 percent identity. source: http://bio.cse.psu.edu conserved_tfbs - Transcription factor binding sites that are conserved in all 3 species, rn3, hg15, mm3. Using the same matrices as binding sites. source: http://bio.cse.psu.edu cpgIsland.dat - CpG islands in the rn3 sequence. source: http://genome.ucsc.edu/ enzyme_info - The enzymes and patterns used for the restriction sites. source: http://www.neb.com/neb/products/res_enzymes/re_update_frame.html gc_percent.dat - The GC Percent track from UCSC for the rn3 freeze. download source: http://genome.ucsc.edu/ genes - The default set of genes are the Locus Link genes from the genbank summary files. The Locus Link ID is then used to tie the gene coordinates to more annotations on the genes. Alternately spliced genes have each listed as a separate gene, therefore a we assigned a unique ID to each gene rather than using the Locus Link ID as the primary key. Any genes in the RefSeq track at UCSC that are not in the set from the genbank files are added. sources: -genbank files *.gbs ftp.ncbi.nih.gov/genomes/R_norvegicus/CHR_N/*.gbs -refseq genes http://genome.ucsc.edu/ -annotations for genes ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz /repository/UniGene/Rn.data.gz /repository/OMIM/genemap http://expression.gnf.org/cgi-bin/index.cgi multiple_aligns.dat - mult_align_segments.dat - Alignments between rn3, hg15, and mm3 using humor. source: http://bio.cse.psu.edu promoter_func.dat - Functional promoters. reference: http://www-shgc.stanford.edu/myerslab/ recombRate.dat - Recombination rates as calculated by deCODE Genetics, Genethon, and Marshfield. download source: http://genome.ucsc.edu/ regulatory.dat - Known regulatory regions source: http://bio.cse.psu.edu repeats - Repetitive elements. download source: http://genome.ucsc.edu/ restriction_sites - Matches to the enzyme pattern for 128 enzymes on the rn3 sequence. source: http://bio.cse.psu.edu rp_multiple - Regulatory potential calculated on multiple alignments from table above. align rn3,hg15,mm3 source: http://bio.cse.psu.edu snpall.dat - snps.dat - The Single Nucleotide Polymorphisms. sources: ftp.ncbi.nih.gov/snp/mssql/data/SubPopAllele.gz, " /ANSI1_flat/ds_flat_chN.flat.gz http://genome.ucsc.edu/ files:snpNih.txt.gz, snpTsc.txt.gz tables.txt - The table definitions. The field types are listed as integer which is from -2,147,483,648 to 2,147,483,647. varchar(N) which is a character field with up to N characters. smallint with is from -32,768 to 32,767. Fields generated from other fields have the equation used to generate the values and are used for indexing.