3DFlu - database and exploration

Sources

Full database

Full database (no esp)

Data sources:

Sequences

Sequences of selected serotypes were extracted from the following databases:
Uniprot (http://www.uniprot.org/),
NCBI (http://www.ncbi.nlm.nih.gov/),
GIS-AID (http://platform.gisaid.org/epi3/frontend#2f0a9a),
Influenza Research Database (http://www.fludb.org/brc/home.spg?decorator=influenza),
EMBL-EPI (http://www.ebi.ac.uk/)

Structures

Crystal structures of Hemagglutinin proteins were downloaded from the Protein Data Bank (PDB) (http://www.rcsb.org).

Metadata

Furthermore for each entry, metadata information such the virus subtype, the year of the infection, the host species and the geographic location of the infected host are included. These information were obtained directly from the above mentioned databases. Incomplete data were supplemented with information available in literature.

Programs:

For homology modeling procedure the following tools were used:
- BLASTp (version 2.2.28+) (http://blast.ncbi.nlm.nih.gov/Blast.cgi) was chosed to select the closest template meeting the following specific criteria: resolution value < 3.0Å, free R-value < 0.3.
- Sequence to structure alignment was calculated using the PCMA (version 2.0) (http://prodata.swmed.edu/pcma/pcma.php) software with default parameters.
- Structures were build using MODELLER (version 9.15) (https://salilab.org/modeller/). For each sequence 10 structures were calculated. MODELLER pseudo-energy criterium was applied in order to select the best candidate for each run.

In order to calculate the similarity between sequence from both crystals and models, the following tool was used:
- The Python library pairwise2 (http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html) was used to calculate pairwise sequence alignments. The e-values obtained were used as sequence similarity measures.

RMSD distances between protein structures from both crystals and models were calculated using the following tool:
- The TM-align method (http://zhanglab.ccmb.med.umich.edu/TM-align/) was used to superimpose and calculate RMSD between respective entries.

Electrostatic Potential for both structures and models was computed with:
- TM-align was used to perform the structural alignment both models and strucures
- APBS (version 1.4.1) (http://www.poissonboltzmann.org) was used to calculate the electrostatic potential grids of the above mentioned structures. Only structural region closed to the binding site was included in the calculations

In order to calculate the ESP distances between protein structures from both crystals and models the following tools was used:
- Pipsa (version 3.1) (http://pipsa.eml.org/pipsa/) was used to compare electrostatic potential grids and generate similarity values.

The protein’s residues mobility was estimated using the following Python package:
- The Gaussian Network Model (GNM) (http://prody.csb.pitt.edu/tutorials/enm_analysis/gnm.html) implemented in ProDY was applied for this purpose

Secondary structure assignment was performed using:
- DSSP (http://www.cmbi.ru.nl/dssp.html) was used to assign secondary structure for proposed models

Multiple Sequence Aligments (MSA) of all the entries within the database was generated using the following methods:
- PCMA (): generates MSA using protein sequences uniquely
- 3D-Coffee (http://www.tcoffee.org/Projects/expresso/): generates structure-based MSA using both sequence-based and structural-based information

Data Structure

root/

models_template_mapping.csv: Contains the information about the template used for modeling a give sequence. The csv file contains two columns:

p_id: sequence used for modeling,
template_id: chosen template

root/aligned_sequences/sequence_alignment

crystals_aln_seq.fasta: The file contains multiple sequence alignment, calculated using PCMA, for the structures extracted from PDB in FASTA format. PDB id is used.
models_aln_seq.fasta: The file contains multiple sequence alignment, calculated using PCMA, for the sequences selected for modeling in FASTA format. GenBank Protein Accession identifier of sequence is used.
crystal_model_aln_seq.fasta: The file contains multiple sequence alignment, calculated using PCMA, for the structures extracted from PDB and sequences selected for modeling in FASTA format. PDB id for crystals and GenBank Protein Accession identifier for sequences are used

root/aligned_sequences/structural_alignment

crystals_aln_str.fasta: The file contains multiple sequence alignment, calculated using 3DCoffee, for the structures extracted from PDB in FASTA format. PDB id is used.
models_aln_str.fasta: The file contains multiple sequence alignment, calculated using 3DCoffee, for the sequences selected for modeling in FASTA format. GenBank Protein Accession identifier of sequence is used.
crystal_model_aln_seq.fasta: The file contains multiple sequence alignment, calculated using 3DCoffee, for the structures extracted from PDB and sequences selected for modeling in FASTA format. PDB id for crystals and GenBank Protein Accession identifier for sequences are used

root/distances/sequence_similarity

crystal_seq_sim.csv

p_id: identifier of the 1st protein
p_id2: identifier of the 2nd protein
distance: FASTA score between the pair of protein sequences (higher score represent more similar sequences)

models_seq_sim.csv

p_id: identifier of the 1st protein
p_id2: identifier of the 2nd protein
distance: FASTA score between the pair of protein sequences (higher score represent more similar sequences)

crystals_models_seq_sim.csv

p_id: identifier of the 1st protein
p_id2: identifier of the 2nd protein
distance: FASTA score between the pair of protein sequences (higher score represent more similar sequences)

root/distances/structural_similarity

crystal_str_sim.csv:

p_id: identifier of the 1st protein
p_id2: identifier of the 2nd protein
distance: RMSD distance between protein structures

models_str_sim.csv

p_id: identifier of the 1st protein
p_id2: identifier of the 2nd protein
distance: RMSD distance between protein structures

crystals_models_str_sim.csv

p_id: identifier of the 1st protein
p_id2: identifier of the 2nd protein
distance: RMSD distance between protein structures

root/distances/esp_similarity

crystals_esp_sim.csv:

p_id: identifier of the 1st protein
p_id2: identifier of the 2nd protein
distance : similarity value for compared proteins ESP maps

models_esp_sim.csv

p_id: identifier of the 1st protein
p_id2: identifier of the 2nd protein
distance: similarity value for compared proteins ESP maps

crystal_models_esp_sim.csv

p_id: identifier of the 1st protein
p_id2: identifier of the 2nd protein
distance: similarity value for compared proteins ESP maps

root/metadata/

metadata_crystals.csv

p_id: PDB protein id (e.g. '4cqv')
chain: protein chain (e.g. 'A')
host: HA host (e.g 'avian')
location: location of sample collection (e.g. 'Turkey')
year: year of sample collection (e.g. '2005')
subtype: HA subtype (e.g 'H5N1')
continent: location of a sample collection inferred from the field location (e.g. 'Europe')
strain_name: strain name of the virus from which HA protein originated (e.g. 'A/Aichi/68(H3N2)')

metadata_models.csv

p_id: gene id of the model (e.g. 'AAA43205')
chain: protein chain (e.g. 'A')
host: HA host (e.g 'avian')
location: location of sample collection (e.g. 'Ontario')
year: year of sample collection (e.g. 1966)
subtype: HA subtype (e.g 'H5N9')
continent: location of a sample collection inferred from the field location (e.g. "Europe")
strain_name: strain name of the virus from which HA protein originated (e.g. 'A/Aichi/68(H3N2)')

root/structures/crystals

Here are located all the structures of HA proteins downloaded and filtered from the PDB database. The original PDB files underwent the following pre-processing:

In case of multi-chain structures (e.g. HA trimers) only single chain corresponding to HA1 subdomain was kept.
The structures were purged of all the existing heteroatoms.
The C-term and N-term were trimmed after structural alignment in order to remove structurally variable regions.
Secondary structure prediction using DSSP was added to each PDB file.

root/structures/models

Here are deposited all the HA proteins models available in this database. Secondary structure prediction using DSSP was added to each PDB file.

root/sequences/crystals

crystals_pdb.fas: In this file raw sequences of analysed proteins were put in a FASTA format. In this case the sequences have been directly extracted from the PDB structures.
crystals.fas: In this file raw sequences of analysed proteins were put in a fasta format. Here, the sequences have been retrieved from SEQRES and then trimmed to be of the same length as the sequences extracted from the PDB structures. This procedure allows to include all the amino acids within the selected intervals that are missing in some PDB structures (e.g. 3s12).

root/sequences/models

models.fa: This file contains raw sequences of protein models in FASTA format.

root/mobility

crystals_mob.csv: This file contains a mobility score for each amino acid for the protein crystals computed according to the GNM (using the ProDY python bundle). The number of comma-separated values for each proteins are equal to the number of aminoacids within the PDB.
models_mob.csv: This file contains a mobility score for each aminoacid within the protein models computed according to the GNM (using the ProDY python bundle). The number of comma-separated values for each proteins are equal to the number of aminoacids within the PDB.

root/HQI

Selected amino acid indexes were applied to the proteins sequences in order to describe proteins in terms of specific physico-chemical properties. The following list of High Quality Indices were extracted from the Amino Acid Index Database and used for this purpose: BLAM930101, BIOV880101, MAXF760101, TSAJ990101, NAKH920108, CEDJ970104, LIFS790101, MIYS990104.
crystals.csv: In this file amino acid indexes calculated for crystals are summarized.

p_id: PDB protein identifier.
HQI: each aminoacid is represented by 8 floating point numbers. The amino acids within a given sequence are separated by semicolumns, while the 8 physico-chemical properties of each aminoacid are separated by commas.

models.csv: In this file amino acid indexes calculated for models are summarized.

p_id: GenBank Protein identifier
HQI: each aminoacid is represented by 8 floating point numbers. The amino acids within a given sequence are separated by semicolumns (;), while the 8 physico-chemical properties of each aminoacid are separated by commas (,).

root/ESP/crystals

In this directory ESP grids in dx format for each crystal can be located.

root/ESP/models

In this directory ESP grids in dx format for each crystal can be located.