This directory has the data for the July 2003 freeze of GALA. aka hg16, build34, gs.17 Directory structure: Readme.txt align_3way/ log.chr*.dat.gz alignments/ concise/ gg0/ *these were the chicken alignments before March 4, 2004 gapchr*.dat.gz localchr*.dat.gz seqfile.dat.gz fr1/ (axtBest) gapchr*.dat.gz localchr*.dat.gz seqfile.dat.gz galGal2/ (axtBest) gapchr*.dat.gz localchr*.dat.gz seqfile.dat.gz mm4/ (axtBest) gapchr*.dat.gz localchr*.dat.gz seqfile.dat.gz pt0/ *these were the chimp alignments before March 4, 2004 gapchr*.dat.gz localchr*.dat.gz seqfile.dat.gz panTro1/ (coming soon) gapchr*.dat.gz localchr*.dat.gz seqfile.dat.gz rn3/ (axtNet) gapchr*.dat.gz localchr*.dat.gz seqfile.dat.gz alternate_genes/ *_Genes.dat.gz *_Genes_exons.dat.gz chromInfo.dat conserved_regions.dat.gz conserved_tfbs/ cpgIsland.dat.gz dberge_ranges.dat.gz default_genes/ exons.dat.gz expression.dat gene_alias.dat.gz gene_cdd.dat.gz gene_dbids.dat.gz gene_disorders.dat.gz gene_expr.dat.gz gene_model_prots.dat.gz gene_ontology.dat.gz gene_orthologs.dat.gz gene_product.dat.gz genename.dat.gz genes.dat.gz tissue.dat.gz functional_promoters.dat.gz fuguBlat/ fuguBlat.dat.gz fuguBlat_block.dat.gz gcPercent.dat.gz genes/ exons.dat.gz gene_alias.dat.gz gene_cdd.dat.gz gene_dbids.dat.gz gene_disorders.dat.gz gene_expr.dat.gz gene_model_prots.dat.gz gene_ontology.dat.gz gene_orthologs.dat.gz gene_product.dat.gz genename.dat.gz genes.dat.gz hbvar_ranges.dat.gz hbvar_ranges_Aug04.dat.gz isochore.dat.gz microRNA.dat.gz mrna/ human_est.dat.gz human_est_block.dat.gz human_mrna.dat.gz human_mrna_block.dat.gz nonhuman_est.dat.gz nonhuman_est_block.dat.gz nonhuman_mrna.dat.gz nonhuman_mrna_block.dat.gz spliced_est.dat.gz spliced_est_block.dat.gz tigr.dat.gz tigr_block.dat.gz unigene.dat.gz unigene_block.dat.gz multiple_aligns/ phylo-hmm/ chr*.dat.gz predicted_promoters.dat.gz recombRate.dat.gz known_regulatory.dat.gz repeats/ chr*_repeats.dat.gz restriction_sites/ chr*.dat.gz rp_multiple/ chr*.3wayRP.dat.gz snp_allele.dat.gz snps.dat.gz tables.txt tissues.dat.gz tfbinding_sites/ chr*.total.gz NOTES on general file format .gz The files were compressed using gzip to save space and download time. .dat The files are comma delimited with text fields enclosed in single quotes. Quotes within text are escaped by doubling them. A Unix newline separates the table rows. NOTES on individual files/tables. alignments - This is a pairwise alignment between hg16 and the release indicated in the subfolder for all but the concise directory. There are mouse, rat, fugu, and chicken alignments. The scoring is done using multiz. These files can be used to generate lav files. The concise directory has the alignments in the concise format used by tools such as strong-hits. source: http://bio.cse.psu.edu UPDATE: these tables were refilled with the second species coordinates being genomic rather than alignment. Feb 2, 2004 align_3way - This is a multiple alignment between hg16, mm3, rn3 with information on whether the region is a repeat. source: http://bio.cse.psu.edu alternate genes - The alternate gene models are represented by pairs of files named after the track name. These are all on freeze hg16. download source: http://genome.ucsc.edu/ source: tracks by different sources, indicated by name chrom_info - The chromosome name, start, and stop, and species. download source: http://genome.ucsc.edu/ conserved_regions.dat - The regions that are found conserved in the pairwise alignments using strong-hits minus exons from GALA's default set of genes. Three levels are measured 70, 80, 90 percent identity. The alignments are hg16/mm4 and hg16/galGal2 and hg16/rn3. source: http://bio.cse.psu.edu cpgIsland.dat - CpG islands in the hg16 sequence. Changed data to cpgIslandsGgfAndy, from genome-test.cse.ucsc.edu on July 22, 2004. source: http://genome.ucsc.edu/ dberge_ranges.dat - The hg16 coordinates for regions studied in the Database of Experimental Results on Gene Expression. This table is likely to become obsolete as the 2 databases become more closely tied together. source: http://bio.cse.psu.edu default_genes - These files were used to replace the old GALA default genes (under genes) on June 24, 2004. The genes come from the UCSC Known Genes and annotations are added as before. The new data for the expressionTissue_info table was moved to this directory as well. See http://globin.cse.psu.edu/gala/restrictions.html#PS for restrictions on use of these genes. sources: -gene coordinates http://genome.ucsc.edu/ -evid code from LL map data found at ftp.ncbi.nih.gov/genomes/H_sapiens/maps/mapview/elements/LOCUS_objects.gz -annotations for genes ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz /repository/UniGene/Hs.data.gz /repository/OMIM/genemap http://expression.gnf.org/cgi-bin/index.cgi enzyme_info.dat - The enzymes and patterns used for the restriction sites. source: http://www.neb.com/neb/products/res_enzymes/re_update_frame.html fuguBlat - The human fugu alignments done using Blat. There are 2 tables. gc_percent.dat - The GC percent in set size windows genome wide. download source: http://genome.ucsc.edu/ genes - The default set of genes are the Locus Link genes from the genbank summary files. The Locus Link ID is then used to tie the gene coordinates to more annotations on the genes. Alternately spliced genes have each listed as a separate gene, therefore a we assigned a unique ID to each gene rather than using the Locus Link ID as the primary key. Any genes in the RefSeq track at UCSC that are not in the set from the genbank files are added. sources: -evid code from LL map data found at ftp.ncbi.nih.gov/genomes/H_sapiens/maps/mapview/elements/LOCUS_objects.gz -genbank files *.gbs ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_N/*.gbs -refseq genes http://genome.ucsc.edu/ -annotations for genes ftp.ncbi.nih.gov/refseq/LocusLink/LL_tmpl.gz /repository/UniGene/Hs.data.gz /repository/OMIM/genemap http://expression.gnf.org/cgi-bin/index.cgi hbvar_ranges.dat - This file contains the database IDs and coordinates for the variants in the HbVar database to aid in connecting the databases. Updated the table with additions corrections that were made since loading on Aug 4, 2004. new file hbvar_ranges_Aug04.dat.gz source: http://globin.cse.psu.edu/hbvar/ isochore.dat - This file contains the regions of arbitrary length that have relatively uniform GC content. The table has the chromosome, start and stop points as well as the GC%. source: http://nekrut.bx.psu.edu/ microRNA.dat - Micro RNA's source: http://www.sanger.ac.uk/Software/Rfam/mirna/ftp_temp/index.shtml mrna - Data from UCSC Genome Browser mRNA and EST tracks. 2 tables per track, with table names reflecting the track name. These are mostly the psl tables in the tables.txt file. download source: http://genome.ucsc.edu/ multiple_aligns - Alignments between hg16, mm3, and rn3 using humor. All species ranges are given with genomic coordinates. source: http://bio.cse.psu.edu phylo-hmm - Conservation scores on 5 way alignments from UCSC. Scores have been averaged in 5 bp windows to keep the table size down. source: http://genome.ucsc.edu/ promoter_func.dat - Functional promoters. reference: http://www-shgc.stanford.edu/myerslab/ coordinates moved by: http://bio.cse.psu.edu recombRate.dat - Recombination rates as calculated by deCODE Genetics, Genethon, and Marshfield. download source: http://genome.ucsc.edu/ known_regulatory.dat - Known regulatory regions, updated March 24, 2004 Updated coordinates for HBG2 promoter June 14, 2004. Removed 2 duplicate entries Aug 4, 2004. source: http://bio.cse.psu.edu repeats - Repetitive elements. download source: http://genome.ucsc.edu/ restriction_sites - Matches to the enzyme pattern for 128 enzymes on the hg16 sequence. source: http://bio.cse.psu.edu rp_multiple - Regulatory potential calculated on multiple alignments from table above. align hg16,mm3,rn3 source: http://bio.cse.psu.edu snpall.dat - snps.dat - The Single Nucleotide Polymorphisms. dbSNP build 118 sources: ftp.ncbi.nih.gov/snp/mssql/data/ files:SNP.bcp.gz, Allele.bcp.gz, SNPAlleleFreq.bcp.gz, SnpClassCode.bcp.gz http://genome.ucsc.edu/ files:snpNih.txt.gz, snpTsc.txt.gz tables.txt - The table definitions. The field types are listed as integer which is from -2,147,483,648 to 2,147,483,647. varchar(N) which is a character field with up to N characters. smallint with is from -32,768 to 32,767. Fields generated from other fields have the equation used to generate the values and are used for indexing. tfbinding_sites - Matches to 164 transcription factor matrices. The matrices come from Transfac (registration required). These are matches on the hg16 sequence of 75% or better. source: http://bio.cse.psu.edu