This directory has the data for the May 2004 freeze of GALA. mm5, build33 Directory structure: Readme.txt alignments/ mm5hg17/ (axtBest) gapchr*.dat.gz localchr*.dat.gz seqfile.dat.gz alternate_genes/ *_Genes.dat.gz *_Genes_exons.dat.gz chromInfo.dat.gz conserved70.dat.gz conserved80.dat.gz conserved90.dat.gz conserved_tfbs/ mm5Hg17Rn3/ chr*.gala.gz mm5Hg17Rn3Canfam1/ chr*.gala.gz cpgIsland.dat.gz gcPercent.dat.gz genes/ exons.dat.gz expression.dat gene_alias.dat.gz gene_cdd.dat.gz gene_dbids.dat.gz gene_expr.dat.gz gene_strain.dat.gz gene_model_prots.dat.gz gene_ontology.dat.gz gene_product.dat.gz genename.dat.gz genes.dat.gz go_defs_info.dat.gz tissue.dat.gz isochore.dat.gz mRNA/ mouse_est.dat.gz mouse_est_block.dat.gz mouse_mrna.dat.gz mouse_mrna_block.dat.gz nonmouse_est.dat.gz nonmouse_est_block.dat.gz nonmouse_mrna.dat.gz nonmouse_mrna_block.dat.gz spliced_est.dat.gz spliced_est_block.dat.gz net_aligns/ net_aligns_canfam1.dat.gz net_aligns_galgal2.dat.gz net_aligns_hg17.dat.gz net_aligns_rn3.dat.gz repeats/ repeatschr*.dat.gz restriction_sites/ chr*.dat.gz rp_multiple/ chr*.rp.gz tables.txt NOTES on general file format .gz The files were compressed using gzip to save space and download time. .dat The files are comma delimited with text fields enclosed in single quotes. Quotes within text are escaped by doubling them. A Unix newline separates the table rows. NOTES on individual files/tables. alignments - This is a pairwise alignment between mm5 and the release indicated in the subfolder for all but the concise directory. There is human alignment. The scoring is done using multiz. These files can be used to generate lav files. The concise directory has the alignments in the concise format used by tools such as strong-hits. source: http://bio.cse.psu.edu alternate genes - The alternate gene models are represented by pairs of files named after the track name. These are all on freeze mm5. download source: http://genome.ucsc.edu/ source: tracks by different sources, indicated by name chrom_info - The chromosome name, start, and stop, and species. download source: http://genome.ucsc.edu/ conserved_regions.dat - The regions that are found conserved in the pairwise alignments using strong-hits minus exons from GALA's default set of genes. Three levels are measured 70, 80, 90 percent identity. The alignments are mm5/hg17. source: http://bio.cse.psu.edu cpgIsland.dat - CpG islands in the mm5 sequence. cpgIslandsExt from genome browser on Aug 24, 2004 source: http://genome.ucsc.edu/ genes - The genes come from the UCSC Known Genes and annotations The new data for the expressionTissue_info table was moved to this directory as well. See http://globin.cse.psu.edu/gala/restrictions.html#PS for restrictions on use of these genes. sources: -gene coordinates http://genome.ucsc.edu/ -evid code from LL map data found at ftp.ncbi.nih.gov/genomes/M_musculus/maps/mapview/elements/LOCUS_objects.gz -annotations for genes ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz /repository/UniGene/Mm.data.gz /repository/OMIM/genemap http://expression.gnf.org/cgi-bin/index.cgi enzyme_info.dat - The enzymes and patterns used for the restriction sites. source: http://www.neb.com/neb/products/res_enzymes/re_update_frame.html isochore.dat - This file contains the regions of arbitrary length that have relatively uniform GC content. The table has the chromosome, start and stop points as well as the GC%. source: http://nekrut.bx.psu.edu/ mRNA - Data from UCSC Genome Browser mRNA and EST tracks. 2 tables per track, with table names reflecting the track name. These are mostly the psl tables in the tables.txt file. download source: http://genome.ucsc.edu/ repeats - Repetitive elements. download source: http://genome.ucsc.edu/ restriction_sites - Matches to the enzyme pattern for 128 enzymes on the hg16 sequence. source: http://bio.cse.psu.edu tables.txt - The table definitions. The field types are listed as integer which is from -2,147,483,648 to 2,147,483,647. varchar(N) which is a character field with up to N characters. smallint with is from -32,768 to 32,767. Fields generated from other fields have the equation used to generate the values and are used for indexing.