Genoscapist

How to deploy on my own data?


Requirement: a Genoscapist instance up and running

Follow the Genoscapist Git deployement or use the Genoscapist Virtual Machine provided prior to this step.

Database creation

  1. Create a database (for example: seb_demo)
    CREATE DATABASE seb_demo;
  2. Create users (for example: read_user and admin_user)
    CREATE USER read_user WITH PASSWORD 'sebuser2020';
    CREATE USER admin_user WITH PASSWORD 'sebadmin2020';
  3. Create schema and tables with the create_database.sql file available in the scripts directory
    $ psql -f create_database.sql
Data integration into the database

Here, it is an example on a small Bacillus subtilis dataset. This dataset contains:

  • a part of Bacillus subtilis genome (only the first 150,000 base pairs),
  • 145 genes (feature type CDS),
  • 269 reannotation samples and 6 rho-mutant samples.

Integration scripts are available in the scripts directory and the dataset files in scripts/data directory.

  1. Create a Conda environment with the environment.yml
    $ conda env create -f environment.yml
  2. Activate the Conda environment
    $ conda activate genoscapist-database
  3. Integrate GenBank annotation
    $ python add_annotation_gb.py
  4. Integrate samples and expression profiles
    $ python add_samples.py

To integrate another data, all annotation features are stored in the features table. Mandatory information are:

  • sequence_id, a reference sequence
  • type, a feature type like CDS, regulator, promoter, terminator
  • complement, a feature strand
  • start, a feature start position on the reference sequence
  • stop, a feature end position on the reference sequence
  • id_feat, a feature name
All GenBank qualifiers are stored in the qualifiers table with a type and a value (the link to the feature is made with a unique feature identifier feature_id).

Sample data are stored in the exp_seq table and are defined by group in the exp_group table. A sample group id defined by a name and a description. Sample data are defined by:

  • sequence_id, a reference sequence
  • experience, a sample name
  • project, a project name (Reannotation or Rho for example)
  • normalization, a normalization type (CustomCDS or Median for example)
  • strand, a sample strand
  • color, a sample color
  • info, 1 if sample displayed by default or 0
Sample are linked to a group with a unique identifier exp_group_id.

Track order configuration

In the table features_manager, you have to define the order of appearance of track.

To insert data, you can use pgAdmin tool or the INSERT SQL command:
INSERT INTO bacteries.features_manager (type, position) VALUES ('[feature_type]', [position]);

For the seb_demo database, we choose to view the CDS track then the profiles track.
seb_demo=# select * from bacteries.features_manager;
 id |   type   | position 
----+----------+----------
  1 | CDS      |        1 
  2 | profiles |        2 
(2 rows)

Configuration file

Before deploying the application, it is necessary to modify the configuration file config.py.

Details about parameters:

  • DEPLOY, deployment environment (local, dev, prod or demo)
  • SPECIE, specie accession number (AL009126 for Bacillus subtilis or CP000253 for Staphylococcus aureus for example)
  • DB_USER, DB_PWD, DB_NAME, DB_HOST and DB_PORT, respectively the database user, the password, the database name, the host server and port
  • NAME, the name of the deployment (for example: B. subtilis Expression Data Browser or S. aureus Expression Data Browser)
  • MIN_CUSTOM, MAX_CUSTOM, MIN_MEDIAN and MAX_MEDIAN, minimum and maximum values of the expression signal for both normalization