Genoscapist

How to deploy on my own data?

Requirement: a Genoscapist instance up and running

Follow the Genoscapist Git deployement or use the Genoscapist Virtual Machine provided prior to this step.

Database creation

Requirement: download the last version of PostgreSQL here.

Create a database (for example: seb_demo)
CREATE DATABASE seb_demo;
Create users (for example: read_user and admin_user)
CREATE USER read_user WITH PASSWORD 'sebuser2020';
CREATE USER admin_user WITH PASSWORD 'sebadmin2020';
Create schema and tables with the create_database.sql file available in the scripts directory
$ psql -f create_database.sql

Data integration into the database

Here, it is an example on a small Bacillus subtilis dataset. This dataset contains:

a part of Bacillus subtilis genome (only the first 150,000 base pairs),
145 genes (feature type CDS),
269 reannotation samples and 6 rho-mutant samples.

Integration scripts are available in the scripts directory and the dataset files in scripts/data directory.

Create a Conda environment with the environment.yml
$ conda env create -f environment.yml
Activate the Conda environment
$ conda activate genoscapist-database
Integrate GenBank annotation
$ python add_annotation_gb.py
Integrate samples and expression profiles
$ python add_samples.py

Note: change the database connection variables (dbname, dbuser, dbpasswd, dbhost, dbport) and the location of the files if you don't use test files.

To integrate another data, all annotation features are stored in the features table. Mandatory information are:

sequence_id, a reference sequence
type, a feature type like CDS, regulator, promoter, terminator
complement, a feature strand
start, a feature start position on the reference sequence
stop, a feature end position on the reference sequence
id_feat, a feature name

All GenBank qualifiers are stored in the qualifiers table with a type and a value (the link to the feature is made with a unique feature identifier feature_id).

Sample data are stored in the exp_seq table and are defined by group in the exp_group table. A sample group id defined by a name and a description. Sample data are defined by:

sequence_id, a reference sequence
experience, a sample name
project, a project name (Reannotation or Rho for example)
normalization, a normalization type (CustomCDS or Median for example)
strand, a sample strand
color, a sample color
info, 1 if sample displayed by default or 0

Sample are linked to a group with a unique identifier exp_group_id.

All elements (features and sample) are positioned on a sequence defined in the sequences table, itself linked to a specie defined in the species table.

Track order configuration

In the table features_manager, you have to define the order of appearance of track.

To insert data, you can use pgAdmin tool or the INSERT SQL command:
INSERT INTO bacteries.features_manager (type, position) VALUES ('[feature_type]', [position]);

Configuration file

Before deploying the application, it is necessary to modify the configuration file config.py.

Details about parameters:

DEPLOY, deployment environment (local, dev, prod or demo)
SPECIE, specie accession number (AL009126 for Bacillus subtilis or CP000253 for Staphylococcus aureus for example)
DB_USER, DB_PWD, DB_NAME, DB_HOST and DB_PORT, respectively the database user, the password, the database name, the host server and port
NAME, the name of the deployment (for example: B. subtilis Expression Data Browser or S. aureus Expression Data Browser)
MIN_CUSTOM, MAX_CUSTOM, MIN_MEDIAN and MAX_MEDIAN, minimum and maximum values of the expression signal for both normalization

To get the values of MIN_CUSTOM, MAX_CUSTOM, MIN_MEDIAN and MAX_MEDIAN, you can use check_min_max.py script.