Perl API Tutorial

Information Information Software Ensembl Variation

Perl API Tutorial

Introduction

This tutorial is an introduction to the Ensembl Variation API. Knowledge of the Ensembl Core API and of the concepts and conventions in the Ensembl Core API tutorial is assumed. Documentation about the Variation database schema is available at http://cvsweb.sanger.ac.uk/cgi-bin/cvsweb.cgi/ensembl-variation/schema/ , and while not necessary for this tutorial, an understanding of the database tables may help as many of the adaptor modules are table specific.

Code Conventions (and unconventions)

Refer to the Ensembl core tutorial for a good description of the coding conventions normally used in Ensembl. Please note that there may be exceptions to these rules in variation.

Connecting an Ensembl variation database

There are two ways to connect to the EnsEMBL Variation database. The old way uses the Bio::EnsEMBL::Variation::DBSQL::DBAdaptor explicitly. The new one uses the Bio::EnsEMBL::Registry module, which can read either a global or a specific configuration file.

Explicitely, using the Bio::EnsEMBL::Variation::DBSQL::DBAdaptor

Ensembl variation data as ensembl core data, is stored in a MySQL relational database. If you want to access a variation database, you will need to connect to it. This is done in exactly the same way as to connect to an ensembl core database, but using a Variation specific DBAdaptor.

use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor

my $host = 'ensembldb.ensembl.org';
my $user = 'anonymous';
my $dbname = 'mus_musculus_variation_38_35';

my $dbVariation= new Bio::EnsEMBL::Variation::DBSQL::DBAdaptor(
    -host	=> $host,
    -user	=> $user,
    -dbname => $dbname);

As for a ensembl core connection, in addition to the parameters provided above, the optional port, driver and pass parameters can also be used to specify the TCP connection port, the type of database driver and the password respectively. These values have sensible defaults and can often be omitted.

Implicitly, using the Bio::EnsEMBL::Registry configuration file (recommended)

You will need to have a registry configuration file set up. By default, it takes the file defined by the ENSEMBL_REGISTRY environment variable or the file named .ensembl_init in your home directory if the former is not found. Additionally, it is possible to use a specific file (see perldoc Bio::EnsEMBL::Registry or later in this document for some examples on how to use a different file). An example of such file can be found in ensembl/modules/Bio/EnsEMBL/Utils/ensembl_init.example, and below you have a slightly modified copy of it.

# Example of configuration file used by Bio::EnsEMBL::Registry::load_all
# method to store/register all kind of Adaptors.

use strict;
use Bio::EnsEMBL::Utils::ConfigRegistry;
use Bio::EnsEMBL::DBSQL::DBAdaptor;
use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor;

my @aliases;

new Bio::EnsEMBL::DBSQL::DBAdaptor(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous',
    -port => 3306,
    -species => 'Homo sapiens',
    -group => 'core',
    -dbname => 'homo_sapiens_core_38_36');

@aliases = ('H_Sapiens', 'homo sapiens', 'Homo_Sapiens','Homo_sapiens', 'Homo', 'homo', 'human');

Bio::EnsEMBL::Utils::ConfigRegistry->add_alias(
    -species => "Homo sapiens",
    -alias => \@aliases);

new Bio::EnsEMBL::Variation::DBSQL::DBAdaptor(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous',
    -port => 3306,
    -species => 'human',
    -dbname => 'homo_sapiens_variation_38_36');

1;

In this configuration file, you can list all the parameters needed to connect a variation database. The variation database is a database that contains information about one particular species. However, in order to have full functionality, you will need to connect to the core database as well. The use of the registry configuration file lets you the freedom to list connection parameters for all Ensembl core databases you might need to access in relation to Ensembl variation data (in our example, only 1 is mentioned, human). All this information is then stored in a single central place, easy to maintain (modify and update). The access to a database adaptor is done using either the main species alias (specified by the -species parameter) or one of the aliases specified (in the @aliases array). No need to remember the complete database name, one of the aliases will be enough.

Another way to use the registry without having to use any configuration file, but only if you want to use the latest databases and do not remember names, is to use the following method: Bio::EnsEMBL::Registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous', ); This way, you will load the correct versions of the ensembl databases for the software release it can find on a database instance into the registry and also adds a set of standard aliases.

Below is a non exhaustive list of Ensembl variation adaptors that are most often used

IndividualAdaptor to fetch Bio::EnsEMBL::Variation::Individual objects
LDFeatureContainerAdaptor to fetch Bio::EnsEMBL::Variation::LDFeatureContainer objects
PopulationAdaptor to fetch Bio::EnsEMBL::Variation::Population objects
ReadCoverageAdaptor to fetch Bio::EnsEMBL::Variation::ReadCoverage objects
TranscriptVariationAdaptor to fetch Bio::EnsEMBL::Variation::TranscriptVariation objects
VariationAdaptor to fetch Bio::EnsEMBL::Variation::Variation objects
VariationFeatureAdaptor to fetch Bio::EnsEMBL::Variation::VariationFeature objects

Only some of these adaptors will be used for illustration as part of this tutorial through commented perl scripts code.

Variations in the genome

One of the most important uses for the variation database is to be able to get all variations in a certain region in the genome. Below it is a simple commented perl script to illustrate how to get all variations in chromosome 25 in zebrafish

use strict;
use warnings;

use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor;
use Bio::EnsEMBL::DBSQL::DBAdaptor;

# connect to Variation database
my $dbVariation = Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new
  (-host   => 'ensembldb.ensembl.org',
   -dbname => 'danio_rerio_variation_37_5d',
   -species => 'zebrafish',
   -group   => 'variation',
   -user   => 'anonymous');

# connect to Core database
my $dbCore = Bio::EnsEMBL::DBSQL::DBAdaptor->new
  (-host   => 'ensembldb.ensembl.org',
   -dbname => 'danio_rerio_core_37_5d',
   -species => 'zebrafish',
   -group   => 'core',
   -user   => 'anonymous');


my $slice_adaptor = $dbCore->get_SliceAdaptor(); #get the database adaptor for Slice objects
my $slice = $slice_adaptor->fetch_by_region('chromosome',25); #get chromosome 25 in zebrafish
my $vf_adaptor = $dbVariation->get_VariationFeatureAdaptor(); #get adaptor to VariationFeature object
my $vfs = $vf_adaptor->fetch_all_by_Slice($slice); #return ALL variations defined in $slice
foreach my $vf (@{$vfs}){
    print "Variation: ", $vf->variation_name, " with alleles ", $vf->allele_string, " in chromosome ", $slice->seq_region_name, " and position ", $vf->start,"-",$vf->end,"\n";

exit 0;

Consequence type of variations

Another common use of the variation database is to get the effects that variations make in a transcript. In the example below, it is explained how to get all variations in a particualr chicken transcript and see which is the effect of that variation in the transcript

use strict;
use warnings;

use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor;
use Bio::EnsEMBL::DBSQL::DBAdaptor;

# connect to Variation database
my $dbVariation = Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new
  (-host   => 'ensembldb.ensembl.org',
   -dbname => 'gallus_gallus_variation_37_1m',
   -species => 'chicken',
   -group   => 'variation',
   -user   => 'anonymous');

# connect to Core database
my $dbCore = Bio::EnsEMBL::DBSQL::DBAdaptor->new
  (-host   => 'ensembldb.ensembl.org',
   -dbname => 'gallus_gallus_core_37_1m',
   -species => 'chicken',
   -group   => 'core',
   -user   => 'anonymous');

my $stable_id = 'ENSGALT00000007843'; #this is the stable_id of a chicken transcript
my $transcript_adaptor = $dbCore->get_TranscriptAdaptor(); #get the adaptor to get the Transcript from the database
my $transcript = $transcript_adaptor->fetch_by_stable_id($stable_id); #get the Transcript object
my $trv_adaptor = $dbVariation->get_TranscriptVariationAdaptor; #get the adaptor to get TranscriptVariation objects

my $trvs = $trv_adaptor->fetch_all_by_Transcripts([$transcript]); #get ALL effects of Variations in the Transcript

foreach my $tv (@{$trvs}){
    print "SNP :",$tv->variation_feature->variation_name, " has a consequence/s ", join(",",@{$tv->consequence_type}), " in transcript ", $stable_id, "\n";
    #print the name of the variation and the effect (consequence_type) of the variation in the Transcript
}
exit 0;

Variations, Flanking sequences and Genes

Below is a complete example on how to use the variation API to retrieve different data from the database. In that particular example, we want to get, for a list of variation names, information about alleles, flanking sequences, locations, effects of variations in transcripts, position in the transcript (in case it has a coding effect) and genes containing the transcripts.

use strict;
use warnings;

use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor;
use Bio::EnsEMBL::DBSQL::DBAdaptor;

# connect to Variation database
my $dbVar = Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new
  (-host   => 'ensembldb.ensembl.org',
   -dbname => 'homo_sapiens_variation_37_35j',
   -species => 'human',
   -group   => 'variation',
   -user   => 'anonymous');

# connect to Core database
my $dbCore = Bio::EnsEMBL::DBSQL::DBAdaptor->new
  (-host   => 'ensembldb.ensembl.org',
   -dbname => 'homo_sapiens_core_37_35j',
   -species => 'human',
   -group   => 'core',
   -user   => 'anonymous');

my $va_adaptor = $dbVar->get_VariationAdaptor; #get the different adaptors for the different objects needed
my $vf_adaptor = $dbVar->get_VariationFeatureAdaptor;
my $gene_adaptor = $dbCore->get_GeneAdaptor;  
my @rsIds = qw(rs1367827 rs1367830);
while (@rsIds){
# get Variation object
  my $var = $va_adaptor->fetch_by_name($_); #get the Variation from the database using the name
  &get_VariationFeatures($var);
}


sub get_VariationFeatures{
	my $var = shift;
        # get all VariationFeature objects: might be more than 1 !!!
	foreach my $vf (@{$vf_adaptor->fetch_all_by_Variation($var)}){
	    print $vf->variation_name(),","; # print rsID
            print $vf->allele_string(),","; # print alleles
            print join(",",@{$vf->get_consequence_type()}),","; # print consequenceType
	    print substr($var->five_prime_flanking_seq,-10) , "[",$vf->allele_string,"]"; #print the allele string
	    print substr($var->three_prime_flanking_seq,0,10), ","; # print RefSeq
	    print $vf->seq_region_name, ":", $vf->start,"-",$vf->end; # print position in Ref in format Chr:start-end
            &get_TranscriptVariations($vf); # get Transcript information
	}
}

sub get_TranscriptVariations{
	my $vf = shift;
        # get all TranscriptVariation objects: might be more than 1 !!!
	my $transcript_variations = $vf->get_all_TranscriptVariations; #get ALL the effects of the variation in different Transcripts
	if (defined $transcript_variations){
	    foreach my $tv (@{$transcript_variations}){
		print ",", $tv->pep_allele_string if (defined $tv->pep_allele_string);# the AA change, but only if it is in a coding region
		my $gene = $gene_adaptor->fetch_by_transcript_id($tv->transcript->dbID);
		print ",",$gene->stable_id if (defined $gene->external_name); # and the external gene name	   

	    }
	}
	print "\n";
}
exit 0;

LD calculation

In order to be able to use the LD calculation, you need to compile the C source code and install a module, called IPC::Run. There is more information on how to do this in Use LD calculation In the example below, it calculates the LD in a region in human chromosome 6 for a HAPMAP population, but only prints when there is a high LD

use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor;
use Bio::EnsEMBL::DBSQL::DBAdaptor;

# connect to Variation database
my $dbVariation = Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new
  (-host   => 'ensembldb.ensembl.org',
   -dbname => 'homo_sapiens_variation_37_35j',
   -species => 'human',
   -group   => 'variation',
   -user   => 'anonymous');

# connect to Core database
my $dbCore = Bio::EnsEMBL::DBSQL::DBAdaptor->new
  (-host   => 'ensembldb.ensembl.org',
   -dbname => 'homo_sapiens_core_37_35j',
   -species => 'human',
   -group   => 'core',
   -user   => 'anonymous');

my $chr = 6;  #defining the region in chromosome 6
my $start = 25_834_000;
my $end = 25_854_000;
my $population_name = 'CSHL-HAPMAP:HapMap-CEU'; #we only want LD in this population

my $slice_adaptor = $dbCore->get_SliceAdaptor(); #get adaptor for Slice object
my $slice = $slice_adaptor->fetch_by_region('chromosome',$chr,$start,$end); #get slice of the region


my $population_adaptor = $dbVariation->get_PopulationAdaptor; #get adaptor for Population object
my $population = $population_adaptor->fetch_by_name($population_name); #get population object from database

my $ldFeatureContainerAdaptor = $dbVariation->get_LDFeatureContainerAdaptor; #get adaptor for LDFeatureContainer object
my $ldFeatureContainer = $ldFeatureContainerAdaptor->fetch_by_Slice($slice,$population); #retrieve all LD values in the region

foreach my $r_square (@{$ldFeatureContainer->get_all_r_square_values}){
    if ($r_square->{r2} > 0.8){ #only print high LD, where high is defined as r2 > 0.8
	print "High LD between variations ", $r_square->{variation1}->variation_name,"-",$r_square->{variation2}->variation_name, "\n";
	
    }
}
exit 0;

Specific strain information

With the apparition of the new technologies, one of the new functionalities that the variation API has is the possibility to work with your specific strain as if it was the reference one, and compare it against others. In the example, we create a StrainSlice object for a mouse exon and compare it against the reference exon.

use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor;
use Bio::EnsEMBL::DBSQL::DBAdaptor;

# connect to Variation database
my $dbVariation = Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new
  (-host   => 'ensembldb.ensembl.org',
   -dbname => 'mus_musculus_variation_37_34e',
   -species => 'mouse',
   -group   => 'variation',
   -user   => 'anonymous');

# connect to Core database
my $dbCore = Bio::EnsEMBL::DBSQL::DBAdaptor->new
  (-host   => 'ensembldb.ensembl.org',
   -dbname => 'mus_musculus_core_37_34e',
   -species => 'mouse',
   -group   => 'core',
   -user   => 'anonymous');

my $exon_stable_id = 'ENSMUSE00000554526'; #mouse exon stable_id
my $strain_name = "129X1/SvJ"; #mouse strain name

my $exon_adaptor = $dbCore->get_ExonAdaptor; #get adaptor for Exon objects

my $exon = $exon_adaptor->fetch_by_stable_id($exon_stable_id); #get exon object
print "Reference sequence: ", substr($exon->seq->seq,0,5), "...", substr($exon->seq->seq,82,5),"...",substr($exon->seq->seq,90,5),"...", "\n"; #print exon sequence

my $strainSlice = $exon->feature_Slice->get_by_strain($strain_name); #get strainSlice for the exon
#print the strain sequence for that exon
print "Strain sequence:    ", substr($strainSlice->seq,0,5), "...", substr($strainSlice->seq,82,5),"...",substr($strainSlice->seq,90,5),"...", "\n";
my $afs = $strainSlice->get_all_AlleleFeatures_Slice(); #get AlleleFeature between reference and strain sequence in the exon
foreach my $af (@{$afs}){
  print "Allele Feature start-end-allele_string: ",$af->start,"-",$af->end,"-",$af->allele_string,"\n";
}
exit 0;

Further help

For additional information or help mail the ensembl-dev mailing list. You will need to subscribe to this mailing list to use it. More information on subscruibing to any Ensembl mailing list is available from the Ensembl Contacts page.

.