This tutorial is an introduction to the Ensembl Variation API. Knowledge of the Ensembl Core API and of the concepts and conventions in the Ensembl Core API tutorial is assumed. Documentation about the Variation database schema is available at http://cvsweb.sanger.ac.uk/cgi-bin/cvsweb.cgi/ensembl-variation/schema/ , and while not necessary for this tutorial, an understanding of the database tables may help as many of the adaptor modules are table specific.
Refer to the Ensembl core tutorial for a good description of the coding conventions normally used in Ensembl. Please note that there may be exceptions to these rules in variation.
There are two ways to connect to the EnsEMBL Variation database. The old way uses the Bio::EnsEMBL::Variation::DBSQL::DBAdaptor explicitly. The new one uses the Bio::EnsEMBL::Registry module, which can read either a global or a specific configuration file.
Ensembl variation data as ensembl core data, is stored in a MySQL relational database. If you want to access a variation database, you will need to connect to it. This is done in exactly the same way as to connect to an ensembl core database, but using a Variation specific DBAdaptor.
use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor my $host = 'ensembldb.ensembl.org'; my $user = 'anonymous'; my $dbname = 'mus_musculus_variation_38_35'; my $dbVariation= new Bio::EnsEMBL::Variation::DBSQL::DBAdaptor( -host => $host, -user => $user, -dbname => $dbname);
As for a ensembl core connection, in addition to the parameters provided above, the optional port, driver and pass parameters can also be used to specify the TCP connection port, the type of database driver and the password respectively. These values have sensible defaults and can often be omitted.
You will need to have a registry configuration file set up. By default, it takes the file defined by the ENSEMBL_REGISTRY environment variable or the file named .ensembl_init in your home directory if the former is not found. Additionally, it is possible to use a specific file (see perldoc Bio::EnsEMBL::Registry or later in this document for some examples on how to use a different file). An example of such file can be found in ensembl/modules/Bio/EnsEMBL/Utils/ensembl_init.example, and below you have a slightly modified copy of it.
# Example of configuration file used by Bio::EnsEMBL::Registry::load_all # method to store/register all kind of Adaptors. use strict; use Bio::EnsEMBL::Utils::ConfigRegistry; use Bio::EnsEMBL::DBSQL::DBAdaptor; use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor; my @aliases; new Bio::EnsEMBL::DBSQL::DBAdaptor( -host => 'ensembldb.ensembl.org', -user => 'anonymous', -port => 3306, -species => 'Homo sapiens', -group => 'core', -dbname => 'homo_sapiens_core_38_36'); @aliases = ('H_Sapiens', 'homo sapiens', 'Homo_Sapiens','Homo_sapiens', 'Homo', 'homo', 'human'); Bio::EnsEMBL::Utils::ConfigRegistry->add_alias( -species => "Homo sapiens", -alias => \@aliases); new Bio::EnsEMBL::Variation::DBSQL::DBAdaptor( -host => 'ensembldb.ensembl.org', -user => 'anonymous', -port => 3306, -species => 'human', -dbname => 'homo_sapiens_variation_38_36'); 1;
In this configuration file, you can list all the parameters needed to connect a variation database. The variation database is a database that contains information about one particular species. However, in order to have full functionality, you will need to connect to the core database as well. The use of the registry configuration file lets you the freedom to list connection parameters for all Ensembl core databases you might need to access in relation to Ensembl variation data (in our example, only 1 is mentioned, human). All this information is then stored in a single central place, easy to maintain (modify and update). The access to a database adaptor is done using either the main species alias (specified by the -species parameter) or one of the aliases specified (in the @aliases array). No need to remember the complete database name, one of the aliases will be enough.
Another way to use the registry without having to use any configuration file, but only if you want to use the latest databases and do not remember names, is to use the following method: Bio::EnsEMBL::Registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous', ); This way, you will load the correct versions of the ensembl databases for the software release it can find on a database instance into the registry and also adds a set of standard aliases.
Below is a non exhaustive list of Ensembl variation adaptors that are most often used
IndividualAdaptor to fetch Bio::EnsEMBL::Variation::Individual objects
LDFeatureContainerAdaptor to fetch Bio::EnsEMBL::Variation::LDFeatureContainer objects
PopulationAdaptor to fetch Bio::EnsEMBL::Variation::Population objects
ReadCoverageAdaptor to fetch Bio::EnsEMBL::Variation::ReadCoverage objects
TranscriptVariationAdaptor to fetch Bio::EnsEMBL::Variation::TranscriptVariation objects
VariationAdaptor to fetch Bio::EnsEMBL::Variation::Variation objects
VariationFeatureAdaptor to fetch Bio::EnsEMBL::Variation::VariationFeature objects
Only some of these adaptors will be used for illustration as part of this tutorial through commented perl scripts code.
One of the most important uses for the variation database is to be able to get all variations in a certain region in the genome. Below it is a simple commented perl script to illustrate how to get all variations in chromosome 25 in zebrafish
use strict; use warnings; use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor; use Bio::EnsEMBL::DBSQL::DBAdaptor; # connect to Variation database my $dbVariation = Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new (-host => 'ensembldb.ensembl.org', -dbname => 'danio_rerio_variation_37_5d', -species => 'zebrafish', -group => 'variation', -user => 'anonymous'); # connect to Core database my $dbCore = Bio::EnsEMBL::DBSQL::DBAdaptor->new (-host => 'ensembldb.ensembl.org', -dbname => 'danio_rerio_core_37_5d', -species => 'zebrafish', -group => 'core', -user => 'anonymous'); my $slice_adaptor = $dbCore->get_SliceAdaptor(); #get the database adaptor for Slice objects my $slice = $slice_adaptor->fetch_by_region('chromosome',25); #get chromosome 25 in zebrafish my $vf_adaptor = $dbVariation->get_VariationFeatureAdaptor(); #get adaptor to VariationFeature object my $vfs = $vf_adaptor->fetch_all_by_Slice($slice); #return ALL variations defined in $slice foreach my $vf (@{$vfs}){ print "Variation: ", $vf->variation_name, " with alleles ", $vf->allele_string, " in chromosome ", $slice->seq_region_name, " and position ", $vf->start,"-",$vf->end,"\n"; exit 0;
Another common use of the variation database is to get the effects that variations make in a transcript. In the example below, it is explained how to get all variations in a particualr chicken transcript and see which is the effect of that variation in the transcript
use strict; use warnings; use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor; use Bio::EnsEMBL::DBSQL::DBAdaptor; # connect to Variation database my $dbVariation = Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new (-host => 'ensembldb.ensembl.org', -dbname => 'gallus_gallus_variation_37_1m', -species => 'chicken', -group => 'variation', -user => 'anonymous'); # connect to Core database my $dbCore = Bio::EnsEMBL::DBSQL::DBAdaptor->new (-host => 'ensembldb.ensembl.org', -dbname => 'gallus_gallus_core_37_1m', -species => 'chicken', -group => 'core', -user => 'anonymous'); my $stable_id = 'ENSGALT00000007843'; #this is the stable_id of a chicken transcript my $transcript_adaptor = $dbCore->get_TranscriptAdaptor(); #get the adaptor to get the Transcript from the database my $transcript = $transcript_adaptor->fetch_by_stable_id($stable_id); #get the Transcript object my $trv_adaptor = $dbVariation->get_TranscriptVariationAdaptor; #get the adaptor to get TranscriptVariation objects my $trvs = $trv_adaptor->fetch_all_by_Transcripts([$transcript]); #get ALL effects of Variations in the Transcript foreach my $tv (@{$trvs}){ print "SNP :",$tv->variation_feature->variation_name, " has a consequence/s ", join(",",@{$tv->consequence_type}), " in transcript ", $stable_id, "\n"; #print the name of the variation and the effect (consequence_type) of the variation in the Transcript } exit 0;
Below is a complete example on how to use the variation API to retrieve different data from the database. In that particular example, we want to get, for a list of variation names, information about alleles, flanking sequences, locations, effects of variations in transcripts, position in the transcript (in case it has a coding effect) and genes containing the transcripts.
use strict; use warnings; use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor; use Bio::EnsEMBL::DBSQL::DBAdaptor; # connect to Variation database my $dbVar = Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new (-host => 'ensembldb.ensembl.org', -dbname => 'homo_sapiens_variation_37_35j', -species => 'human', -group => 'variation', -user => 'anonymous'); # connect to Core database my $dbCore = Bio::EnsEMBL::DBSQL::DBAdaptor->new (-host => 'ensembldb.ensembl.org', -dbname => 'homo_sapiens_core_37_35j', -species => 'human', -group => 'core', -user => 'anonymous'); my $va_adaptor = $dbVar->get_VariationAdaptor; #get the different adaptors for the different objects needed my $vf_adaptor = $dbVar->get_VariationFeatureAdaptor; my $gene_adaptor = $dbCore->get_GeneAdaptor; my @rsIds = qw(rs1367827 rs1367830); while (@rsIds){ # get Variation object my $var = $va_adaptor->fetch_by_name($_); #get the Variation from the database using the name &get_VariationFeatures($var); } sub get_VariationFeatures{ my $var = shift; # get all VariationFeature objects: might be more than 1 !!! foreach my $vf (@{$vf_adaptor->fetch_all_by_Variation($var)}){ print $vf->variation_name(),","; # print rsID print $vf->allele_string(),","; # print alleles print join(",",@{$vf->get_consequence_type()}),","; # print consequenceType print substr($var->five_prime_flanking_seq,-10) , "[",$vf->allele_string,"]"; #print the allele string print substr($var->three_prime_flanking_seq,0,10), ","; # print RefSeq print $vf->seq_region_name, ":", $vf->start,"-",$vf->end; # print position in Ref in format Chr:start-end &get_TranscriptVariations($vf); # get Transcript information } } sub get_TranscriptVariations{ my $vf = shift; # get all TranscriptVariation objects: might be more than 1 !!! my $transcript_variations = $vf->get_all_TranscriptVariations; #get ALL the effects of the variation in different Transcripts if (defined $transcript_variations){ foreach my $tv (@{$transcript_variations}){ print ",", $tv->pep_allele_string if (defined $tv->pep_allele_string);# the AA change, but only if it is in a coding region my $gene = $gene_adaptor->fetch_by_transcript_id($tv->transcript->dbID); print ",",$gene->stable_id if (defined $gene->external_name); # and the external gene name } } print "\n"; } exit 0;
In order to be able to use the LD calculation, you need to compile the C source code and install a module, called IPC::Run. There is more information on how to do this in Use LD calculation In the example below, it calculates the LD in a region in human chromosome 6 for a HAPMAP population, but only prints when there is a high LD
use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor; use Bio::EnsEMBL::DBSQL::DBAdaptor; # connect to Variation database my $dbVariation = Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new (-host => 'ensembldb.ensembl.org', -dbname => 'homo_sapiens_variation_37_35j', -species => 'human', -group => 'variation', -user => 'anonymous'); # connect to Core database my $dbCore = Bio::EnsEMBL::DBSQL::DBAdaptor->new (-host => 'ensembldb.ensembl.org', -dbname => 'homo_sapiens_core_37_35j', -species => 'human', -group => 'core', -user => 'anonymous'); my $chr = 6; #defining the region in chromosome 6 my $start = 25_834_000; my $end = 25_854_000; my $population_name = 'CSHL-HAPMAP:HapMap-CEU'; #we only want LD in this population my $slice_adaptor = $dbCore->get_SliceAdaptor(); #get adaptor for Slice object my $slice = $slice_adaptor->fetch_by_region('chromosome',$chr,$start,$end); #get slice of the region my $population_adaptor = $dbVariation->get_PopulationAdaptor; #get adaptor for Population object my $population = $population_adaptor->fetch_by_name($population_name); #get population object from database my $ldFeatureContainerAdaptor = $dbVariation->get_LDFeatureContainerAdaptor; #get adaptor for LDFeatureContainer object my $ldFeatureContainer = $ldFeatureContainerAdaptor->fetch_by_Slice($slice,$population); #retrieve all LD values in the region foreach my $r_square (@{$ldFeatureContainer->get_all_r_square_values}){ if ($r_square->{r2} > 0.8){ #only print high LD, where high is defined as r2 > 0.8 print "High LD between variations ", $r_square->{variation1}->variation_name,"-",$r_square->{variation2}->variation_name, "\n"; } } exit 0;
With the apparition of the new technologies, one of the new functionalities that the variation API has is the possibility to work with your specific strain as if it was the reference one, and compare it against others. In the example, we create a StrainSlice object for a mouse exon and compare it against the reference exon.
use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor; use Bio::EnsEMBL::DBSQL::DBAdaptor; # connect to Variation database my $dbVariation = Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new (-host => 'ensembldb.ensembl.org', -dbname => 'mus_musculus_variation_37_34e', -species => 'mouse', -group => 'variation', -user => 'anonymous'); # connect to Core database my $dbCore = Bio::EnsEMBL::DBSQL::DBAdaptor->new (-host => 'ensembldb.ensembl.org', -dbname => 'mus_musculus_core_37_34e', -species => 'mouse', -group => 'core', -user => 'anonymous'); my $exon_stable_id = 'ENSMUSE00000554526'; #mouse exon stable_id my $strain_name = "129X1/SvJ"; #mouse strain name my $exon_adaptor = $dbCore->get_ExonAdaptor; #get adaptor for Exon objects my $exon = $exon_adaptor->fetch_by_stable_id($exon_stable_id); #get exon object print "Reference sequence: ", substr($exon->seq->seq,0,5), "...", substr($exon->seq->seq,82,5),"...",substr($exon->seq->seq,90,5),"...", "\n"; #print exon sequence my $strainSlice = $exon->feature_Slice->get_by_strain($strain_name); #get strainSlice for the exon #print the strain sequence for that exon print "Strain sequence: ", substr($strainSlice->seq,0,5), "...", substr($strainSlice->seq,82,5),"...",substr($strainSlice->seq,90,5),"...", "\n"; my $afs = $strainSlice->get_all_AlleleFeatures_Slice(); #get AlleleFeature between reference and strain sequence in the exon foreach my $af (@{$afs}){ print "Allele Feature start-end-allele_string: ",$af->start,"-",$af->end,"-",$af->allele_string,"\n"; } exit 0;
For additional information or help mail the ensembl-dev mailing list. You will need to subscribe to this mailing list to use it. More information on subscruibing to any Ensembl mailing list is available from the Ensembl Contacts page.
© 2024 Inserm. Hosted by genouest.org. This product includes software developed by Ensembl.