Given that you have already followed the instructions for installation of the Ensembl core and functional genomics APIs (http://www.ensembl.org/info/software/api_installation.html), the next step is to set up the eFG specific requirements. This is an exhaustive list and may not be necessary if you do not intend to use the full functionality of eFG. Install the following as required:
The eFG system uses a shell environment to set global variables and help perform common tasks. You will need to edit the .efg file accordingly:
efg@bc-9-1-02>more ensembl-functgenomics/scripts/.efg #!/usr/local/bin/bash echo "Setting up the Ensembl Function Genomics environment..." ### ENV VARS ### #Prompt export PS1='efg@$PS1HOST>' #Code/Data Directories export SRC=~/src #Root source code directory. EDIT export EFG_SRC=$SRC/ensembl-functgenomics #eFG source directory export EFG_SQL=$EFG_SRC/sql #eFG SQL export EFG_DATA=/your/data/dir/efg #Data directory. EDIT export PATH=$PATH:$EFG_SRC/scripts #eFG scripts directory export PERL5LIB=$EFG_SRC/modules:$PERL5LIB #Update PERL5LIB. EDIT add ensembl(core) etc. if required #Your efg DB connection params export WRITE_USER='write_user' #EDIT export READ_USER='read_user' #EDIT export HOST='efg-host' #EDIT export PORT=3306 #EDIT export MYSQL_ARGS="-h${HOST} -P${PORT}" #Your ensembl core DB connection params, read only export CORE_USER='anonymous' #EDIT if required export CORE_HOST='ensembldb.ensembl.org' #EDIT if required export CORE_PORT=3306 #EDIT if required #Default norm and analysis methods export NORM_METHOD='VSN_GLOG' #EDIT if required e.g. T.Biweight, Loess export PEAK_METHOD='Nessie' #EDIT if required e.g. TileMap, MPeak, Chipotle #R config export R_LIBS=${R_LIBS:=$SRC/R-modules} #EDIT if required export R_PATH=/software/bin/R #Location of local version of R. EDIT export R_FARM_PATH=/software/R-2.4.0/bin/R #Location of farm installed R. EDIT export R_BSUB_OPTIONS="-R'select[type==LINUX64 && mem>6000] rusage[mem=6000]' -q bigmem" #EDIT
As is indicated at the head of the .efg file, to enable easy access to the eFG environment it is useful to add the following to your .*rc login file:
alias efg='. ~/src/ensembl-efg/.efg'
Once this is done simple type 'efg' to enter the envornment, which will give you access to some helper functions such as, CreateDB:
efg@bc-9-1-02>CreateDB my_homo_sapiens_funcgen_47_36i password Creating DB my_homo_sapiens_funcgen_47_36i
It is desirable to maintain the standard Ensembl nomenclature for a database and simply prefix it with some descriptive tag. Failure to do so may cause problems in dynamically detecting the correct core DB to use. The CreateDB function also supports overwriting of a particular instance of an eFG DB by specifying a third 'drop' argument:
efg@bc-9-1-02>CreateDB my_homo_sapiens_funcgen_47_36i password drop Dropping DB my_homo_sapiens_funcgen_47_36i Creating DB my_homo_sapiens_funcgen_47_36i
Once you have set up the environment, you are now ready to import data or query the central ensembl or a local copy of an eFG DB.
Note: It is not necessary to set up the environment if you simply want to query a remote eFG DB i.e. The eFG Dbs available at ensembldb.ensembl.org. However, you may find that some of the tools scripts will require explicit definition of some of the above environment variables via the command line.
There are various types of data import, export and transformation which can be performed using the scripts available in the scripts directory. These encompass simple cell and feature type imports, through to array design and full experiment imports. Most of the more common tasks have template shell scripts with required parameters set and others left for editing. Here follows a list of the main types of tool script:
Prior to running your first experiment import, you will likely need to import the necessary features types first.
efg@bc-9-1-02>more run_import_type.sh #!/bin/sh PASS=$1 shift $EFG_SRC/scripts/import_type.pl\ -type FeatureType\ -name H3K4me3\ -dbname your_homo_sapiens_funcgen_48_36j\ -description 'Histone 3 Lysine 4 Tri-methyl'\ -class HISTONE\ -pass $PASS
Feature type names should correspond to a recognised ontology or nomenclature where appropriate e.g. Brno nomenclature for histones. The class parameter is not required for CellType imports.
To import an experiment you must first create an input directory for the array vendor and your experiment e.g.mkdir $EFG_DATA/input/NIMBLEGEN/EXPERIMENT_NAME
The eFG system currently expects only one experiment per input directory. If your DVD contains more than one experiment, you will need to split the files up, recreating any meta files accordingly e.g. DesignNotes.txt, SampleKey.txt. A Nimblegen experiment import can be done using the appropriate run script:
efg@bc-9-1-02>more run_NIMBLEGEN.sh #!/bin/sh PASS=$1 shift $EFG_SRC/scripts/parse_and_import.pl\ -name 'DVD_OR_EXPERIMENT_NAME'\ #Name of the data directory -format tiled\ #Array format -vendor NIMBLEGEN\ -location Hinxton\ #Your group location -contact 'your@email.com'\ -species homo_sapiens\ -fasta\ #Flag to dump the array as a fasta file, useful for remapping -port 3306\ -host dbhost\ -dbname 'your_homo_sapiens_funcgen_47_36i'\ -array_set\ #Flag to treat every chip/slide as part of one array -array_name "DESIGN_NAME"\ -cell_type e.g. U2OS\ -feature_type e.g. H3K4me3\ -group efg\ #Your groupname -data_version 41_36c\ #The Ensembl data version corresponding to your data -verbose\ -tee\ -pass $PASS\ -recover #Enables recovery mode for failed/partial imports
Running the above script will perform a preliminary import. This involves validation checks and import of some basic meta data. The meta data gleaned from the import parameters and available files is automatically populated within a tab2mage file located in the output directory, at which point the import stops to allow manual curation of the tab2mage file. Due to the lack of comprehensive meta data associated with an experiment DVD, it is necessary to inspect, correct and annotate the tab2mage file where possible. Failure to do so may result in permanent loss of meta data, an inability to submit to ArrayExpress and a corrupted import which ultimately may prevent any further analysis.
There are three main areas to be addressed, most of which may have been automatically populated. Fields which need attention are marked with three question marks e.g. ???
tab2mage field | Value |
---|---|
BioSource | CellType or specific source sample name if known. |
Sample | Biological replicate name. |
Extract | Technical replicate name. |
LabeledExtract | Control/Experimental channel sample. |
Immunoprecipitate | Description of IP, blank for control channel. |
Hybridization | Description of Hybridisation. |
BioSourceMaterial | e.g. cell, tissue MGED? |
Dye | e.g. Cy3/5 |
BioMaterialCharacteristics[StrainOrLine] | |
BioMaterialCharacteristics[CellType] | |
FactorValue[StrainOrLine] | |
FactorValue[Immunoprecipitate] | e.g. anti-H3ac antibody |
Some standard naming formats have been put in place to aid validation. Primarily these are the replicate names which have been given the BRN and TRN denominations for biological replicates and techniocal replicates respectively e.g.
EXPERIMENT_BR1_TR1Here we see two biological replicate for "EXPERIMENT", the first having two technical replicates and the second having just one. The other naming convention adopted is that of the "FactorValue[Immunoprecipitate]" field. This must follow the format of the example above i.e. anti-"FeatureType Name" antibody
This is parsed during validation and used to store chip/slide level information. Replicates with mismatching feature type names in this field will fail validation.
© 2025 Inserm. Hosted by genouest.org. This product includes software developed by Ensembl.