Database Schema

Information Information Software Ensembl Compara

Database Schema

This document refers to the schema of the EnsEMBL Compara version 39. This is the revision 0.4 of this document.

Introduction

EnsEMBL Compara DB is divided into two parts:

Genomic Alignments - This part is dedicated to DNA-DNA alignments. It includes alignments made with different alignment tools. It also includes a list of syntenic regions.
Homologues and Families (or Protein Clusters) - This part is dedicated to protein-protein alignments and it includes protein families, protein trees and homologues defined from the protein trees.

The protein families are clusters of proteins. They are defined using the full set of Ensembl proteins plus all the metazoan proteins from SwissProt and SPTrEMBL.

The protein trees are built using the longest transcript of each protein coding Ensembl gene. Homologies are derived from them.

List of Tables

General Tables
Genomic Alignment Tables
Orthologues and Protein Cluster Tables

General Tables

meta table description
Field	Type	Key	Default	Description
meta_id	int unsigned	PRI	NULL	internal unique ID
meta_key	varchar(40)	MUL
meta_value	varchar(255)	MUL

ncbi_taxa_node

Contains a description of the taxonomic relathionships between all the taxa used in this database.

ncbi_taxa_node table description
Field	Type	Null	Key	Default	Description
taxon_id	int(10) unsigned		MUL	0	the NCBI taxonomy ID
parent_id	int(10) unsigned		MUL	0	the parent taxonomy ID for this node (refers to ncbi_taxa_node.taxon_id)
rank	char(32)	YES	MUL	NULL	e.g. kingdom, family, genus, etc.
genbank_hidden_flag	tinyint(1)			0	boolean value which defines whether this rank is used or not in the abbreviated lineage
left_index	int(10)			0	sub-set left index. All sub-nodes have left_index and right_index values larger than this left_index
right_index	int(10)			0	sub-set right index. All sub-nodes have left_index and right_index values smaller than this right_index
root_id	int(10)			1	the root taxonomy ID for this node (refers to ncbi_taxa_node.taxon_id)

(c.f. ncbi_taxa_name for examples)

ncbi_taxa_name

Contains descriptions the taxonimc nodes defined in the ncbi_taxa_node table.

ncbi_taxa_name table description
Field	Type	Null	Key	Default	Description
taxon_id	int(10) unsigned		MUL	0	external reference to ncbi_taxa_node.taxon_id
name	varchar(255)	YES	MUL	NULL	information assigned to this taxon_id
name_class	varchar(255)	YES	MUL	NULL	type of information. e.g. common name, genbank_synonym, scientif name, etc.

Here is an example on how to get the taxonomic ID for a species:

mysql> SELECT * FROM ncbi_taxa_name WHERE name_class = "scientific name"
AND name = "Homo sapiens";
+----------+--------------+-----------------+
| taxon_id | name | name_class |
+----------+--------------+-----------------+
| 9606 | Homo sapiens | scientific name |
+----------+--------------+-----------------+

Here is another example on how to get the lineage for the hominidae family:

mysql> SELECT * FROM ncbi_taxa_name WHERE name_class = "scientific name"
AND name = "Hominidae";
+----------+-----------+-----------------+
| taxon_id | name | name_class |
+----------+-----------+-----------------+
| 9604 | Hominidae | scientific name |
+----------+-----------+-----------------+

mysql> SELECT * FROM ncbi_taxa_node WHERE taxon_id = 9604;
+----------+-----------+--------+---------------------+------------+-------------+---------+
| taxon_id | parent_id | rank | genbank_hidden_flag | left_index | right_index | root_id |
+----------+-----------+--------+---------------------+------------+-------------+---------+
| 9604 | 314295 | family | 0 | 351172 | 351217 | 1 |
+----------+-----------+--------+---------------------+------------+-------------+---------+

mysql> SELECT * FROM ncbi_taxa_node WHERE left_index > 351172 AND left_index < 351217;
+----------+-----------+------------+---------------------+------------+-------------+---------+
| taxon_id | parent_id | rank | genbank_hidden_flag | left_index | right_index | root_id |
+----------+-----------+------------+---------------------+------------+-------------+---------+
| 9592 | 207598 | genus | 0 | 351184 | 351195 | 1 |
| 9593 | 9592 | species | 1 | 351185 | 351194 | 1 |
| 9594 | 9593 | subspecies | 1 | 351188 | 351189 | 1 |
| 9595 | 9593 | subspecies | 1 | 351192 | 351193 | 1 |
| 9596 | 207598 | genus | 0 | 351202 | 351215 | 1 |
| 9597 | 9596 | species | 1 | 351213 | 351214 | 1 |
| 9598 | 9596 | species | 1 | 351203 | 351212 | 1 |
| 9599 | 9604 | genus | 0 | 351173 | 351182 | 1 |
| 9600 | 9599 | species | 1 | 351174 | 351179 | 1 |
| 9601 | 9600 | subspecies | 1 | 351177 | 351178 | 1 |
| 9602 | 9600 | subspecies | 1 | 351175 | 351176 | 1 |
| 9603 | 9599 | species | 1 | 351180 | 351181 | 1 |
| 9605 | 207598 | genus | 0 | 351196 | 351201 | 1 |
| 9606 | 9605 | species | 1 | 351197 | 351200 | 1 |
| 37010 | 9598 | subspecies | 1 | 351208 | 351209 | 1 |
| 37011 | 9598 | subspecies | 1 | 351210 | 351211 | 1 |
| 37012 | 9598 | subspecies | 1 | 351204 | 351205 | 1 |
| 46359 | 9593 | subspecies | 1 | 351186 | 351187 | 1 |
| 63221 | 9606 | subspecies | 1 | 351198 | 351199 | 1 |
| 91950 | 9598 | subspecies | 1 | 351206 | 351207 | 1 |
| 183511 | 9593 | subspecies | 1 | 351190 | 351191 | 1 |
| 207598 | 9604 | no rank | 1 | 351183 | 351216 | 1 |
+----------+-----------+------------+---------------------+------------+-------------+---------+

genome_db

Contains information about the version of the genome assemblies used in this database.

genome_db table description
Field	Type	Null	Key	Default	Extra	Description
genome_db_id	int(10) unsigned		PRI	NULL	auto_increment	internal unique ID
taxon_id	int(10) unsigned			0		external reference to ncbi_taxa_node.taxon_id
name	varchar(40)		MUL			name of the species
assembly	varchar(100)					assembly version of the genome
assembly_default	tinyint(1)	YES		1		boolean value describing if this assembly is the default one or not, so that we can handle more than one assempbly version for a given species.
genebuild	varchar(100)					version of the genebuild
locator	varchar(255)	YES		NULL		used for production purposes or for user configuration in in-house installation.

Eg the rows:

mysql> SELECT * FROM genome_db WHERE name IN ("Homo sapiens", "Gallus gallus");
+--------------+----------+---------------+----------+------------------+-------------+---------+
| genome_db_id | taxon_id | name | assembly | assembly_default | genebuild | locator |
+--------------+----------+---------------+----------+------------------+-------------+---------+
| 11 | 9031 | Gallus gallus | WASHUC1 | 1 | 0512Ensembl | NULL |
| 22 | 9606 | Homo sapiens | NCBI36 | 1 | 0510Ensembl | NULL |
+--------------+----------+---------------+----------+------------------+-------------+---------+

correspond to the Human and Chicken genomes

species_set

Contains groups or sets of species which are used in the method_link_species_set table.

species_set table description
Field	Type	Null	Key	Default	Extra	Description
species_set_id	int(10) unsigned		PRI	NULL	auto_increment	internal (non-unique) ID
genome_db_id	int(10) unsigned		PRI	0		external reference to genome_db.genome_db_id

E.g. the rows

mysql> SELECT * FROM species_set WHERE species_set_id = 181;
+----------------+--------------+
| species_set_id | genome_db_id |
+----------------+--------------+
| 181 | 11 |
| 181 | 22 |
+----------------+--------------+

correspond to the species_set for the Human and Chicken genomes.

method_link

Contains the list of alignment methods used to find links (homologies) between entities in compara.

method_link table description
Field	Type	Key	Default	Extra	Description
method_link_id	int(10) unsigned	PRI	NULL	auto_increment	internal unique ID
type	varchar(50)	MUL			the common name of the linking method between species.
class	varchar(50)		Description of data associated with the \"type\" field

Current values are:

mysql> SELECT * FROM method_link;
+----------------+--------------------------+---------------------------------------+
| method_link_id | type                     | class                                 |
+----------------+--------------------------+---------------------------------------+
|              1 | BLASTZ_NET               | GenomicAlignBlock.pairwise_alignment  |
|              2 | BLASTZ_NET_TIGHT         | GenomicAlignBlock.pairwise_alignment  |
|              3 | BLASTZ_RECIP_NET         | GenomicAlignBlock.pairwise_alignment  |
|              4 | PHUSION_BLASTN           | GenomicAlignBlock.pairwise_alignment  |
|              5 | PHUSION_BLASTN_TIGHT     | GenomicAlignBlock.pairwise_alignment  |
|              6 | TRANSLATED_BLAT          | GenomicAlignBlock.pairwise_alignment  |
|              7 | BLASTZ_GROUP             | GenomicAlignBlock.pairwise_alignment  |
|              8 | BLASTZ_GROUP_TIGHT       | GenomicAlignBlock.pairwise_alignment  |
|            101 | SYNTENY                  | SyntenyRegion.synteny                 |
|            201 | ENSEMBL_ORTHOLOGUES      | Homology.homology                     |
|            202 | ENSEMBL_PARALOGUES       | Homology.homology                     |
|            301 | FAMILY                   | Family.family                         |
|              9 | MLAGAN                   | GenomicAlignBlock.multiple_alignment  |
|            401 | PROTEIN_TREES            | ProteinTree.protein_tree_node         |
|            204 | ENSEMBL_HOMOLOGUES       | Homology.homology                     |
|             10 | PECAN                    | GenomicAlignBlock.multiple_alignment  |
|             11 | GERP_CONSTRAINED_ELEMENT | GenomicAlignBlock.constrained_element |
|            501 | GERP_CONSERVATION_SCORE  | ConservationScore.conservation_score  |
+----------------+--------------------------+---------------------------------------+

We use method_link_ids between 1 and 100 for DNA-DNA alignments, between 101 and 200 for genomic syntenies, between 201 and 300 for protein homologies, between 301 and 400 for protein families and between 401 and 500 for protein trees. Each category corresponds to data stored in different tables

method_link_species_set

This table contains information about the comparisons stored in the database. A given method_link_species_set_id exist for each comparison made and relates a method_link.method_link_id with a set of species (species_set.species_set_id.

method_link_species table description
Field	Type	Null	Key	Default	Extra	Description
method_link_species_set_id	int(10) unsigned		MUL	NULL	auto_increment	internal unique ID.
method_link_id	int(10) unsigned	YES	MUL	NULL		external reference to method_link.method_link_id
species_set_id	int(10) unsigned			0		external reference to species_set.species_set_id
name	varchar(255)			NULL		human-readable description for this method_link_species_set
source	varchar(255)			ensembl		source of the data. Currently either ensembl or ucsc if data were imported from UCSC
url	varchar(255)			NULL		A URL where you can find the orignal data if they were imported.

E.g. the rows

mysql> SELECT * FROM method_link_species_set WHERE method_link_id = 9;
+----------------------------+----------------+----------------+------------------------------+---------+-----+
| method_link_species_set_id | method_link_id | species_set_id | name | source | url |
+----------------------------+----------------+----------------+------------------------------+---------+-----+
| 192 | 9 | 30003 | 5 eutherian mammals MLAGAN | ensembl | |
| 193 | 9 | 30004 | 7 amniota vertebrates MLAGAN | ensembl | |
| 214 | 9 | 30079 | 3 primates MLAGAN | ensembl | |
+----------------------------+----------------+----------------+------------------------------+---------+-----+

correspond to all the MLAGAN alignments in this database.

Genomic Alignment Tables

dnafrag

This table defines the genomic sequences used in the comparative genomics analyisis. It is used by the genomic_align_block table to define aligned sequences. It is also used by the dnafrag_region table to define syntenic regions.

dnafrag table description
Field	Type	Null	Key	Default	Extra	Description
dnafrag_id	int(10) unsigned		PRI	NULL	auto_increment	internal unique ID
length	int(11)			0
name	varchar(40)		MUL			name of the DNA sequence (e.g., the name of the chromosome)
genome_db_id	int(10) unsigned			0		external reference to genome_db.genome_db_id
coord_system_name	varchar(40)	YES		NULL		refers to the coord system in which this dnafrag has been defined

E.g. the row

mysql> SELECT dnafrag.* FROM dnafrag LEFT JOIN genome_db USING (genome_db_id) WHERE
dnafrag.name = "14" AND genome_db.name = "Homo sapiens";
+------------+-----------+------+--------------+-------------------+
| dnafrag_id | length | name | genome_db_id | coord_system_name |
+------------+-----------+------+--------------+-------------------+
| 905410 | 106368585 | 14 | 22 | chromosome |
+------------+-----------+------+--------------+-------------------+

refer to the chromosome 14 of the Human genome (genome_db.genome_db_id = 22 refers to Human genome in this example) which is 106368585 nucleotides long.

genomic_align_block

This table is the key table for the genomic alignments. The software used to align the genomic blocks is refered as an external key to the method_link table. Nevertheless, actual aligned sequences are defined in the genomic_align table.

genomic_align_block table description
Field	Type	Null	Key	Default	Extra	Description
genomic_align_block_id	bigint(20) unsigned		PRI	NULL	auto_increment	internal unique ID
method_link_species_set_id	int(10) unsigned		MUL	0		external reference to method_link_species_set.method_link_species_set_id
score	double	YES		NULL		score returned by the homology search program
perc_id	tinyint(3) unsigned	YES		NULL		Used for pairwise comparison. Defines the percentage of identity between both sequences
length	int(10)	YES		NULL		total length of the alignment

E.g. the row

mysql> SELECT * FROM genomic_align_block WHERE genomic_align_block_id = 2130000103409;
+------------------------+----------------------------+-------+---------+--------+
| genomic_align_block_id | method_link_species_set_id | score | perc_id | length |
+------------------------+----------------------------+-------+---------+--------+
| 2130000103409 | 213 | 992 | 53 | 71 |
+------------------------+----------------------------+-------+---------+--------+

will refer to a BLASTZ_NET alignment between human and chicken genomes (method_link_species_set.method_link_species_set_id = 213) with a score of 992, an identity of 53% and a length of 71 nucleotides. The actual sequences corresponding to this aligment are defined in the genomic_align table.

genomic_align

This table contains the coordinates and all the information needed to rebuild genomic alignments. Every entry corresponds to one of the aligned sequences. It also contains an external key to the method_link_species_set which refers to the software and set of species used for getting the corresponding alignment. The aligned sequence is defined by an external reference to the dnafrag table, the starting and ending position within this dnafrag, the strand and a cigar_line.

The original aligned sequence is not stored but it can be retrieved using the cigar_line field and the original sequence. The cigar line defines the sequence of matches/mismatches and deletions (or gaps). For example, this cigar line 2MD3M2D2M will mean that the alignment contains 2 matches/mismatches, 1 deletion (number 1 is omitted in order to save some space), 3 matches/mismatches, 2 deletions and 2 matches/mismatches. If the original sequence is:

Original sequence: AACGCTT

The aligned sequence will be:

cigar line: 2MD3M2D2M
M	M	D	M	M	M	D	D	M	M
A	A	-	C	G	C	-	-	T	T

genomic_align_block table description
Field	Type	Null	Key	Default	Extra	Description
genomic_align_id	bigint(20) unsigned		PRI	NULL	auto_increment	unique internal id
genomic_align_block_id	bigint(10) unsigned		MUL	0		external reference to genomic_align_block.genomic_align_block_id
method_link_species_set_id	int(10) unsigned			0		external reference to method_link_species_set.method_link_species_set_id. This information is redundant because it also appears in the genomic_align_block table but it is used to speed up the queries
dnafrag_id	int(10) unsigned		MUL	0		external reference to dnafrag.dnafrag_id
dnafrag_start	int(10)			0		starting position within the dnafrag defined by dnafrag_id
dnafrag_end	int(10)			0		ending position within the dnafrag defined by dnafrag_id
dnafrag_strand	tinyint(4)			0		strand in the dnafrag defined by dnafrag_id
cigar_line	mediumtext	YES		NULL		internal description of the aligned sequence
level_id	tinyint(2)			0		level of orhologous layer. 1 corresponds to the first layer of orthologous sequences found, 2 and over are addiotional layers. Use for building the syntenies (based on level_id = 1 only)

E.g. the rows

mysql> SELECT * FROM genomic_align WHERE genomic_align_block_id = 2130000103409;
+------------------+------------------------+----------------------------+------------+-...
| genomic_align_id | genomic_align_block_id | method_link_species_set_id | dnafrag_id |
+------------------+------------------------+----------------------------+------------+-...
| 2130000206817 | 2130000103409 | 213 | 905409 |
| 2130000206818 | 2130000103409 | 213 | 23206 |
+------------------+------------------------+----------------------------+------------+-...

 ...-+---------------+-------------+----------------+------------+----------+
 | dnafrag_start | dnafrag_end | dnafrag_strand | cigar_line | level_id |
 ...-+---------------+-------------+----------------+------------+----------+
 | 10347506 | 10347576 | 1 | 71M | 2 |
 | 462910 | 462978 | 1 | 41M2D28M | 2 |
 ...-+---------------+-------------+----------------+------------+----------+

correspond to both pieces of sequences included in the alignment described above (see genomic_align_block table description). The first sequence includes the nucleotides from 10347506 to 10347576 in the forward strand of the chromosome 17 of the Human genome (dnafrag.dnafrag_id = 905409). The second sequence includes the nucleotides from 462910 to 462978 in the forward strand of the chromosome 18 of the Chicken genome (dnafrag.dnafrag_id = 23206).

Here is a better way to get this by joining the dnafrag and genome_db tables:

mysql> SELECT genome_db.name, dnafrag.name, dnafrag_start, dnafrag_end, dnafrag_strand, cigar_line
FROM genomic_align LEFT JOIN dnafrag USING (dnafrag_id) LEFT JOIN genome_db USING (genome_db_id)
WHERE genomic_align_block_id = 2130000103409;
+---------------+------+---------------+-------------+----------------+------------+
| name | name | dnafrag_start | dnafrag_end | dnafrag_strand | cigar_line |
+---------------+------+---------------+-------------+----------------+------------+
| Homo sapiens | 17 | 10347506 | 10347576 | 1 | 71M |
| Gallus gallus | 18 | 462910 | 462978 | 1 | 41M2D28M |
+---------------+------+---------------+-------------+----------------+------------+

The aligned sequences can be rebuild using the original sequences fetched from the corresponding core databases and the cigar lines as explained before.

genomic_align_group

This table is used to group alignments.

genomic_align_group table description
Field	Type	Key	Default	Extra	Description
group_id	int(10) unsigned	MUL	NULL	auto_increment	internal ID
type	varchar(40)				This field allow us to group genomic_aligns in several ways (types)
genomic_align_id	bigint(20) unsigned	MUL	0		external reference to genomic_align.genomic_align_id

E.g. the rows

mysql> SELECT * FROM genomic_align_group WHERE group_id = 2130000000314;
+---------------+---------+------------------+
| group_id | type | genomic_align_id |
+---------------+---------+------------------+
| 2130000000314 | default | 2130000205251 |
| 2130000000314 | default | 2130000205252 |
| 2130000000314 | default | 2130000205255 |
| 2130000000314 | default | 2130000205253 |
| 2130000000314 | default | 2130000206402 |
| 2130000000314 | default | 2130000206401 |
| 2130000000314 | default | 2130000206578 |
| 2130000000314 | default | 2130000206577 |
| 2130000000314 | default | 2130000206818 |
| 2130000000314 | default | 2130000206817 |
| 2130000000314 | default | 2130000207114 |
| 2130000000314 | default | 2130000207113 |
| 2130000000314 | default | 2130000207148 |
| 2130000000314 | default | 2130000207147 |
| 2130000000314 | default | 2130000207149 |
| 2130000000314 | default | 2130000207150 |
| 2130000000314 | default | 2130000207243 |
| 2130000000314 | default | 2130000207244 |
+---------------+---------+------------------+

correspond to the group of several genomic_align in which the alignment described before is included.

synteny_region

Contains all the syntenic relationships found and the relative orientation of both syntenic regions.

synteny_region table description
Field	Type	Null	Key	Default	Extra	Description
synteny_region_id	int(10) unsigned		PRI	NULL	auto_increment	internal unique ID
method_link_species_set_id	int(10) unsigned			0		external reference to method_link_species_set.method_link_species_set_id.

E.g. the row

mysql> SELECT * FROM synteny_region WHERE synteny_region_id = 29099;
+-------------------+----------------------------+
| synteny_region_id | method_link_species_set_id |
+-------------------+----------------------------+
| 29099 | 10007 |
+-------------------+----------------------------+

means that the syntenic region 29099 corresponds to a syntenic relationship between the Human and Chicken genomes. See dnafrag_region table for more details.

dnafrag_region

Contains the genomic regions corresponding to every syntenic relationship found. There are two genomic regions for every syntenic relationship.

dnafrag_region table description
Field	Type	Key	Description
synteny_region_id	int(10) unsigned	PRI	external reference to synteny_region.synteny_region_id
dnafrag_id	int(10) unsigned	PRI	external reference to dnafrag.dnafrag_id
dnafrag_start	int(10) unsigned		first nucleotide from this dnafrag which is in synteny
dnafrag_end	int(10) unsigned		last nucleotide from this dnafrag which is in synteny
dnafrag_strand	tinyint(4)		strand of this region

E.g. the rows

mysql> SELECT * FROM dnafrag_region WHERE synteny_region_id = 29099;
+-------------------+------------+---------------+-------------+----------------+
| synteny_region_id | dnafrag_id | dnafrag_start | dnafrag_end | dnafrag_strand |
+-------------------+------------+---------------+-------------+----------------+
| 29099 | 23206 | 281326 | 2819393 | -1 |
| 29099 | 905409 | 8236729 | 15175628 | 1 |
+-------------------+------------+---------------+-------------+----------------+

correspond to both genomic regions of the syntenic region 29099. In this case, the first genomic region corresponds to the negative strand of the sequence from 281326 to 2819393 of the chromosome 18 of the Chicken genome (dnafrag_id = 23206 for this chromosome) and the second one corresponds to the sequence from 8236729 to 15175628 of the chromosome 17 of the Human genome (dnafrag_id = 905409 for this chromosome).

Here is a better way to get this by joining the dnafrag and genome_db tables:

A
mysql> SELECT genome_db.name, dnafrag.name, dnafrag_start, dnafrag_end, dnafrag_strand FROM dnafrag_region LEFT JOIN dnafrag USING (dnafrag_id) LEFT JOIN genome_db USING (genome_db_id) WHERE synteny_region_id = 29099;
+---------------+------+---------------+-------------+----------------+
| name | name | dnafrag_start | dnafrag_end | dnafrag_strand |
+---------------+------+---------------+-------------+----------------+
| Gallus gallus | 18 | 281326 | 2819393 | -1 |
| Homo sapiens | 17 | 8236729 | 15175628 | 1 |
+---------------+------+---------------+-------------+----------------+

Tables For Orthologues and Protein Clusters

member

This table links sequences to the EnsEMBL core DB or to external DBs.

member table description
Field	Type	Null	Key	Default	Extra	Description
member_id	int(10) unsigned		PRI	NULL	auto_increment	internal unique ID
stable_id	varchar(40)					EnsEMBL stable ID or external ID (for Uniprot/SWISSPROT and Uniprot/SPTREMBL)
version	int(10)	YES		0		version of the stable ID (see EnsEMBL core DB)
source_name	varchar(40)					describe the source of the member (Uniprot/SWISSPROT, Uniprot/SPTREMBL, ENSEMBLGENE, ENSEMBLPEP)
taxon_id	int(10) unsigned			0		external reference to taxon.taxon_id
genome_db_id	int(10) unsigned	YES		NULL		external reference to genome_db.genome_db_id
sequence_id	int(10) unsigned		MUL	NULL		external reference to sequence.sequence_id. May be 0 when the sequence is not available in the sequence table, e.g. for a gene instance.
gene_member_id	int(10) unsigned		MUL	NULL		external reference to member.memebr_id to allow linkage from peptides to genes.
description	text	YES		NULL		the description of the gene/protein as described in the core database or from the Uniprot entry
chr_name	varchar(40)	YES		NULL		chromosome where this sequence is located
chr_start	int(10)	YES		NULL		first nucleotide of this chromosome which corresponds to this member
chr_end	int(10)	YES		NULL		last nucleotide of this chromosome which corresponds to this member
chr_strand	tinyint(1)			0		strand of the chromosome in which the member is
display_label	varchar(128)	YES		NULL

E.g. the row

mysql> SELECT * FROM member WHERE stable_id = "ENSP00000309431";
+-----------+-----------------+---------+-------------+----------+--------------+-------------+-...
| member_id | stable_id | version | source_name | taxon_id | genome_db_id | sequence_id |
+-----------+-----------------+---------+-------------+----------+--------------+-------------+-...
| 156982 | ENSP00000309431 | 31 | ENSEMBLPEP | 9606 | 22 | 145811 |
+-----------+-----------------+---------+-------------+----------+--------------+-------------+-...

 ...-+----------------+-...
 | gene_member_id |
 ...-+----------------+-...
 | 156981 |
 ...-+----------------+-...

 ...-+---------------------------------------------------------------------------------------+-...
 | description |
 ...-+---------------------------------------------------------------------------------------+-...
 | Transcript:ENST00000308911 Gene:ENSG00000173213 Chr:18 Start:37358 End:39626 |
 ...-+---------------------------------------------------------------------------------------+-...

 ...-+----------+-----------+---------+------------+
 | chr_name | chr_start | chr_end | chr_strand |
 ...-+----------+-----------+---------+------------+
 | 18 | 37390 | 39557 | -1 |
 ...-+----------+-----------+---------+------------+

refers to the human (taxon_id = 9606 or genome_db_id = 22) peptide ENSP00000309431 which is located in the chromosome 18 (from 37390 to 39557, in the reverse strand). This peptide is described as "Transcript:ENST00000308911 Gene:ENSG00000173213 Chr:18 Start:37358 End:39626" and the sequence can be found in the sequence table.

sequence

This table contains the protein sequences present in the member table used in the protein alignment part of the EnsEMBL Compara DB.

sequence table description
Field	Type	Null	Key	Default	Extra	Description
sequence_id	int(10) unsigned		PRI	NULL	auto_increment	internal unique ID
length	int(10)	YES		NULL		the length of the sequence
sequence	longtext	YES		NULL		the sequence

E.g. the row

mysql> select * from sequence where sequence_id = 139425;
+-------------+--------+---------------------------------------------------------------+
| sequence_id | length | sequence |
+-------------+--------+---------------------------------------------------------------+
| 145811 | 444 | MREIVLTQTGQCGNQIGAKFWEVISDEHAIDSAGTYHGDSHLQLERINVHHHEASG[...] |
+-------------+--------+---------------------------------------------------------------+

contains a 41 aminoacids long sequence.

analysis

This table is mainly used for production purposes.

analysis table description
Field	Type	Null	Key	Default	Extra	Description
analysis_id	int(10) unsigned		PRI	NULL	auto_increment	internal unique ID
created	datetime			0000-00-00 00:00:00		date to distinguish newer and older versions off the same analysis. Not well maintained so far.
logic_name	varchar(40)		UNI			string to identify the analysis. Used mainly inside pipeline.
db	varchar(120)	YES		NULL		db should be a database name, db version the version of that db
db_version	varchar(40)	YES		NULL
db_file	varchar(120)	YES		NULL		the file system location of that database, probably wiser to generate from just db and configurations
program	varchar(80)	YES		NULL		The binary used to create a feature. Similar semantic to above
program_version	varchar(40)	YES		NULL
program_file	varchar(80)	YES		NULL
parameters	varchar(255)	YES		NULL		a parameter string which is processed by the perl module
module	varchar(80)	YES		NULL		Perl module names (RunnableDBS usually) executing this analysis
module_version	varchar(40)	YES		NULL
gff_source	varchar(40)	YES		NULL		how to make a gff dump from features with this analysis
gff_feature	varchar(40)	YES		NULL		how to make a gff dump from features with this analysis

analysis_description

This table has been added to comply with core Bio::EnsEMBL::DBSQL::AnalysisAdaptor requirements.

analysis_description table description
Field	Type	Null	Key	Default	Description
analysis_id	int(10) unsigned		MUL	0	external reference to analysis.analysis_id
description	text	YES		NULL
display_label	varchar(255)	YES		NULL

peptide_align_feature

This tables stores the raw HSP local alignment results of peptide to peptide alignments returned by a BLAST run it is translated from a FeaturePair object.

peptide_align_feature table description
Field	Type	Null	Key	Default	Extra	Description
peptide_align_feature_id	int(10) unsigned		PRI	NULL	auto_increment	internal unique ID
qmember_id	int(10) unsigned		MUL	0		external reference to member.member_id for the query peptide
hmember_id	int(10) unsigned		MUL	0		external reference to member.member_id for the hit peptide
qgenome_db_id	int(10) unsigned			0		external reference to genome_db.genome_db_id for the query peptide (for query optimization)
hgenome_db_id	int(10) unsigned			0		external reference to genome_db.genome_db_id for the hit peptide (for query optimization)
analysis_id	int(10) unsigned			0		external reference to analysis.analyis_id
qstart	int(10)			0		starting position in the query peptide sequence
qend	int(10)			0		ending position in the query peptide sequence
hstart	int(10)			0		starting position in the hit peptide sequence
hend	int(10)			0		ending position in the hit peptide sequence
score	double(16,4)			0.0000		blast score for this HSP
evalue	varchar(20)	YES		NULL		blast evalue for this HSP
align_length	int(10)	YES		NULL		alignment length of HSP
identical_matches	int(10)	YES		NULL		blast HSP match score
perc_ident	int(10)	YES		NULL		percent identical matches in the HSP length
positive_matches	int(10)	YES		NULL		blast HSP positive score
perc_pos	int(10)	YES		NULL		precent positive matches in the HSP length
hit_rank	int(10)	YES		NULL		rank in blast result
cigar_line	mediumtext	YES		NULL		cigar string coding the actual alignment

E.g. the rows:

mysql> SELECT * FROM peptide_align_feature WHERE qmember_id = 442105;
+--------------------------+------------+------------+---------------+---------------+-...
| peptide_align_feature_id | qmember_id | hmember_id | qgenome_db_id | hgenome_db_id |
+--------------------------+------------+------------+---------------+---------------+-...
| 1 | 442105 | 248885 | 3 | 12 |
| 2 | 442105 | 86297 | 3 | 12 |
| 3 | 442105 | 215369 | 3 | 12 |
| 4 | 442105 | 67917 | 3 | 12 |
| 5 | 442105 | 182642 | 3 | 12 |
| 6 | 442105 | 212528 | 3 | 12 |
| 7 | 442105 | 215342 | 3 | 12 |
| 8 | 442105 | 260556 | 3 | 5 |
+--------------------------+------------+------------+---------------+---------------+-...

 ...-+-------------+--------+------+--------+------+----------+---------+--------------+-...
 | analysis_id | qstart | qend | hstart | hend | score | evalue | align_length |
 ...-+-------------+--------+------+--------+------+----------+---------+--------------+-...
 | 14 | 3 | 240 | 38 | 276 | 276.0000 | 1.1e-25 | 243 |
 | 14 | 32 | 244 | 8 | 220 | 224.0000 | 3.5e-20 | 219 |
 | 14 | 30 | 213 | 8 | 198 | 212.0000 | 6.6e-19 | 194 |
 | 14 | 32 | 244 | 2 | 219 | 176.0000 | 2.2e-14 | 224 |
 | 14 | 36 | 227 | 1 | 201 | 161.0000 | 5.2e-13 | 205 |
 | 14 | 30 | 219 | 2 | 208 | 158.0000 | 8.5e-12 | 214 |
 | 14 | 29 | 214 | 7 | 201 | 158.0000 | 9.8e-12 | 201 |
 | 15 | 32 | 242 | 8 | 224 | 249.0000 | 1.2e-22 | 220 |
 ...-+-------------+--------+------+--------+------+----------+---------+--------------+-...

 ...-+-------------------+------------+------------------+----------+----------+-...
 | identical_matches | perc_ident | positive_matches | perc_pos | hit_rank |
 ...-+-------------------+------------+------------------+----------+----------+-...
 | 80 | 32 | 124 | 51 | 1 |
 | 66 | 30 | 112 | 51 | 2 |
 | 63 | 32 | 104 | 53 | 3 |
 | 60 | 26 | 107 | 47 | 4 |
 | 61 | 29 | 99 | 48 | 5 |
 | 61 | 28 | 98 | 45 | 6 |
 | 57 | 28 | 97 | 48 | 7 |
 | 73 | 33 | 109 | 49 | 1 |
 ...-+-------------------+------------+------------------+----------+----------+-...

 ...-+---------------------------------------------+
 | cigar_line |
 ...-+---------------------------------------------+
 | 23M2I34MD42MI13MI51MD43M3D28M |
 | 31M2D9M3I27M2D15MI56MD39MI7MD20MI3M |
 | 27M2D7MD23M2D20M2I16MI33M2D9MD8M2D38M |
 | 40M3D8M3D3M2D24M2I8MI33MD21MD43MD16M3I11M |
 | 33MD5M2D8M5D20MI24MI29M4D17MD42M2I10M |
 | 25M3D17M2D21M6I57M13D18MD3M4DMI36MD5M |
 | 30MD7M3D9MI31M2I5M2D17M3I22M3D13M4D29M2D17M |
 | 33M2D7M2D20M4D9MI16MI55MD65MI3M |
 ...-+---------------------------------------------+

corresponds to all the hits found for the rat peptide defined by the member.member_id 442105.

homology

Contains all the genomic homologies found. There are two homology_member entries for each homology entry for now, but both the schema and the API can handle more than just pairwise relations.

homology table description
Field	Type	Null	Key	Default	Extra	Description
homology_id	int(10) unsigned		PRI	NULL	auto_increment	internal unique ID
stable_id	varchar(40)	YES		NULL		stable ID of the pairwise homology relationship
method_link_species_set_id	int(10) unsigned		MUL	0		external reference to method_link_species_set.method_link_species_set_id
description	varchar(40)	YES		NULL		describes the type of homology found: ortholog_one2one apparent_ortholog_one2one ortholog_one2many ortholog_many2many between_species_paralog within_species_paralog
subtype	varchar(40)					gives the taxonomy name of the latest common taxonomic ancestor for the considered homology e.g. Homo sapiens Amniota Diptera Etc...
dn	float(10,5)	YES		NULL		number of nonsynonymous substitutions per nonsynonymous site
ds	float(10,5)	YES		NULL		number of synonymous substitutions per synonymous site
n	float(10,1)	YES		NULL		number of nonsynonymous substitutions
s	float(10,1)	YES		NULL		number of synonymous substitutions
lnl	float(10,3)	YES		NULL		maximum likelihood test value
threshold_on_ds	float(10,5)	YES		NULL		used by the EnsEMBL Web Browser to decide whether or not to display dN/DS ratio
node_id	int(10) unsigned	YES		NULL		external reference to protein_tree_node.node_id that represents the ancestor node between the 2 genes involved in the homology relation

dN, dS, N, S and lnL are statistical values given by the codeml program of the Phylogenetic Analysis by Maximum Likelihood (PAML) package. We are not providing these data since we moved to the new gene tree based homology pipeline although we expect to provide them again on 2007

E.g. the row

mysql> select * from homology where homology_id = 296648;
+-------------+------------------------+----------------------------+-------------+-...
| homology_id | stable_id | method_link_species_set_id | description |
+-------------+------------------------+----------------------------+-------------+-...
| 296648 | 9606_10116_01092698700 | 28 | UBRH |
+-------------+------------------------+----------------------------+-------------+-...

 ...-+---------+---------+---------+-------+------+----------+-----------------+
 | subtype | dn | ds | n | s | lnl | threshold_on_ds |
 ...-+---------+---------+---------+-------+------+----------+-----------------+
 | | 0.12053 | 0.57107 | 166.2 | 64.8 | -440.544 | 1.27166 |
 ...-+---------+---------+---------+-------+------+----------+-----------------+

defines a UBRH homology found which stable ID is 9606_10116_01092698700.

N.B. Those stable IDs are not stable for now. Proper stable IDs (e.g. ENSORTHO1234567890) will be added in the near future.

homology_member

Contains the sequences corresponding to every genomic homology relationship found. There are two homology_member entries for each pairwise homology entry. As written in the homology table section, both schema and API can deal with more than pairwise relationships.

The original alignment is not stored but it can be retrieved using the cigar_line field and the original sequences. The cigar line defines the sequence of matches or mismatches and deletions in the alignment.

First peptide sequence: SERCQVVVISIGPISVLSMILDFY

Second peptide sequence: SDRCQVLVISILSMIGLDFY

```
First corresponding cigar line: 20MD4M
```

Second corresponding cigar line: 11M5D9M

The alignment will be:

Example of alignment reconstruction
First peptide cigar line	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	D	M	M	M	M
First aligned peptide	S	E	R	C	Q	V	V	V	I	S	I	G	P	I	S	V	L	S	M	I	-	L	D	F	Y
Second aligned peptide	S	D	R	C	Q	V	L	V	I	S	I	-	-	-	-	-	L	S	M	I	G	L	D	F	Y
Second peptide cigar line	M	M	M	M	M	M	M	M	M	M	M	D	D	D	D	D	M	M	M	M	M	M	M	M	M

homology_member table description
Field	Type	Null	Key	Default	Description
homology_id	int(10) unsigned		PRI	0	external reference to homology.homology_id
member_id	int(10) unsigned		PRI	0	external reference to member.member_id. It refers to the corresponding gene (ENSEMBL_GENE).
peptide_member_id	int(10) unsigned	YES		NULL	external reference to member.member_id. It refers to the peptide/protein (ENSEMBL_PEP).
peptide_align_feature_id	int(10) unsigned	YES		NULL	external reference to peptide_align_feature.peptide_align_feature_id
cigar_line	mediumtext	YES		NULL	an internal description of the alignment. It contains mathces/mismatches (M) and delations (D)and refers to the corresponding peptide_member_id sequence.
cigar_start	int(10)	YES		NULL	defines the first aligned aminoacid
cigar_end	int(10)	YES		NULL	defines the last aligned aminoacid
perc_cov	int(10)	YES		NULL	defines the percentage of the peptide which has been aligned
perc_id	int(10)	YES		NULL	defines the percentage of identity between both homologues
perc_pos	int(10)	YES		NULL	defines the percentage of positivity (similarity) between both homologues

E.g. the rows

mysql> select * from homology_member where homology_id = 296648;
+-------------+-----------+-------------------+--------------------------+------------+-...
| homology_id | member_id | peptide_member_id | peptide_align_feature_id | cigar_line |
+-------------+-----------+-------------------+--------------------------+------------+-...
| 296648 | 685 | 722 | 13942767 | 77M |
| 296648 | 406456 | 406463 | 3428665 | 77M |
+-------------+-----------+-------------------+--------------------------+------------+-...

 ...-+-------------+-----------+----------+---------+----------+
 | cigar_start | cigar_end | perc_cov | perc_id | perc_pos |
 ...-+-------------+-----------+----------+---------+----------+
 | 1 | 77 | 30 | 79 | 87 |
 | 91 | 167 | 38 | 79 | 87 |
 ...-+-------------+-----------+----------+---------+----------+

refer to the two homologue sequences defined by the homology.homology_id 296648. The gene corresponding to the first sequence can be retrieved using the member.member_id 685 and the corresponding peptide using the member.member_id 722. Gene and peptide sequence of the second homologue can retrieved in the same way.

family

Contains all the group homologies found. There are several family_member entries for each family entry.

family table description
Field	Type	Null	Key	Default	Extra	Description
family_id	int(10) unsigned		PRI	NULL	auto_increment	internal unique ID
stable_id	varchar(40)		UNI			stable family ID
method_link_species_set_id	int(10) unsigned			0		external reference to method_link_species_set.method_link_species_set_id
description	varchar(255)	YES	MUL	NULL		description of the family as found using the Longest Common String (LCS) of the descriptions of the member proteins.
description_score	double	YES		NULL		Scores the accuracy of the annotation (max. 100)

E.g. the row

mysql> select * from family WHERE family_id = 13252;
+-----------+-----------------+----------------------------+-...
| family_id | stable_id | method_link_species_set_id |
+-----------+-----------------+----------------------------+-...
| 13252 | ENSF00000013252 | 68 |
+-----------+-----------------+----------------------------+-...

 ...-+---------------------------------+-------------------+
 | description | description_score |
 ...-+---------------------------------+-------------------+
 | NADH PYROPHOSPHATASE EC_3.6.1.- | 100 |
 ..._+---------------------------------+-------------------+

defines a family homology found which stable ID is ENSF00000013252 and the description of this family is "NADH PYROPHOSPHATASE EC_3.6.1.-" scored with a 100.

family_member

Contains the proteins corresponding to protein family relationship found. There are several family_member entries for each family entry.

family_member table description
Field	Type	Null	Key	Default	Description
family_id	int(10) unsigned		PRI	0	external reference to family.family_id
member_id	int(10) unsigned		PRI	0	external reference to member.member_id
cigar_line	mediumtext	YES		NULL	internal description of the multiple alignment (see homology_member table)

E.g. the rows

mysql> SELECT * FROM family_member WHERE family_id = 13252;
+-----------+-----------+------------+
| family_id | member_id | cigar_line |
+-----------+-----------+------------+
| 13252 | 69013 | NULL |
| 13252 | 69028 | 26D348M |
| 13252 | 217503 | NULL |
| 13252 | 217511 | 374M |
| 13252 | 823691 | 26D348M |
+-----------+-----------+------------+

refer to the five members of the protein family 13252. The proteins can be retieved using the member_ids. The multiple alignment can be restored using the cigar_lines.

domain

Not used by now

domain table description
Field	Type	Null	Key	Default	Extra	Description
domain_id	int(10) unsigned		PRI	NULL	auto_increment	internal unique ID
stable_id	varchar(40)
method_link_species_set	int(10) unsigned			0		external reference to method_link_species_set.method_link_species_set_id
description	varchar(255)	YES		NULL

domain_member

Not used by now

domain_member table description
Field	Type	Null	Key	Default
domain_id	int(10) unsigned		MUL	0
member_id	int(10) unsigned		MUL	0
member_start	int(10)	YES		NULL
member_end	int(10)	YES		NULL

protein_tree_node

Contains the data structure of each tree.

protein_tree_node table description
Field	Type	Key	Default	Extra
node_id	int(10) unsigned	PRI	NULL	auto_increment
parent_id	int(10) unsigned	MUL	0
root_id	int(10) unsigned	MUL	0
left_index	int(10)	MUL	0
right_index	int(10)	MUL	0
distance_to_parent	double		1

When parent_id=0, then node_id is the root. In our data structure, all trees are arbitrarily connected to the same root.
This allows storing in the same database the data from independant tree building analysis. Hence the "biological root"
of each tree are the children nodes of the main root. For instance the following SQL query

> select count(*) from protein_tree_node where parent_id=root_id;
+----------+
| count(*) |
+----------+
| 25673 |
+----------+
1 row in set (1.03 sec)

will return the number of independant trees stored in the database.

protein_tree_member

Contains the data about the leaf (joining the table with a member_id from the member table, all ENSEMBLPEP members)
present in each tree. The method_link_species_set should link to an entry that has a method_link.type="PROTEIN_TREES". For the aligment, the cigar_* column hold the needed information to rebuild the alignment using the Perl API, as it is done in the family/family_member tables. As we store global multiple aligment you will notice that cigar_start and cigar_end are always NULL. The table however is able to local multiple alignment, then cigar_start/cigar_end should be set to their correct value.

protein_tree_member table description
Field	Type	Null	Key	Default	Extra	Description
node_id	int(10) unsigned		PRI	0	auto_increment	external reference to protein_tree_node.node_id
member_id	int(10) unsigned		MUL	0		external reference to member.member_id. Should always link to a protein (usually member.source_name='ENSEMBLPEP')
method_link_species_set_id	int(10) unsigned			0		external reference to method_link_species_set.method_link_species_set_id
cigar_line	mediumtext	YES		NULL
cigar_start	int(10)	YES		NULL
cigar_end	int(10)	YES		NULL

protein_tree_tag

Contains several tag/value data attached to node_ids

protein_tree_tag table description
Field	Type	Null	Key	Default	Description
node_id	int(10) unsigned		MUL	0	external reference to protein_tree_node.node_id
tag	varchar(50)	YES	MUL	NULL
value	mediumtext	YES		NULL

.

Database Schema

Introduction

List of Tables

General Tables

meta

ncbi_taxa_node

ncbi_taxa_name

genome_db

species_set

method_link

method_link_species_set

Genomic Alignment Tables

dnafrag

genomic_align_block

genomic_align

genomic_align_group

synteny_region

dnafrag_region

Tables For Orthologues and Protein Clusters

member

sequence

analysis

analysis_description

peptide_align_feature

homology

homology_member

family

family_member

domain

domain_member

protein_tree_node

protein_tree_member

protein_tree_tag

GermOnline