SecStrAnnotator:Analysis
SecStrAnnotator Suite provides scripts (Python, R) for batch annotation of the whole family and analysis of the annotation results.
Procedure
[edit]Data preparation
[edit]The directory scripts/secstrapi_data_preparation/
contains a pipeline for annotating the whole protein family, including:
- downloading the list of family members defined by CATH and Pfam,
- downloading their structures,
- selecting a non-redundant set,
- annotation,
- multiple sequence alignment for individual SSEs,
- formatting into SecStrAPI format,
- formatting into TSV format for further analyses.
The whole pipeline can be executed by scripts/SecStrAPI_pipeline.py
Example usage:
python3 scripts/SecStrAPI_pipeline.py scripts/SecStrAPI_pipeline_settings.json --resume
Before running, modify the settings in SecStrAPI_pipeline_settings.json
to set your family of interest, annotation template, data directory etc (see README.txt
for more details).
Data analysis
[edit]The directory scripts/R_sec_str_anatomy_analysis/
contains a pipeline for statistical analysis of the annotation results on the whole protein family, including:
- reading the annotation results from TSV,
- generating plots,
- performing statistical test to compare eukaryotic and bacterial structures (or any two sets of structures).
Example usage:
- Launch
rstudio
from the said directory - In
sec_str_anatomy.R
, set DATADIR to the path to your annotation data created in #Data preparation - In
sec_str_anatomy_settings.R
, modify the family-specific settings (list of helices and strands) - Run
sec_str_anatomy.R
line by line
Example case study: Cytochromes P450
[edit]Data
[edit]For the Cytochrome P450 family, structures of 1855 protein domains are available, located in 1012 PDB entries (updated on 7 July 2020). The analysis was performed on a non-redundant subset containing 183 protein domains.
The data are available here (structural files not included because of their size).
Occurrence of SSEs
[edit]The occurrence describes in what percentage of the structures a particular SSE is present.
Length of SSEs
[edit]The length of an SSE is measured as the number of residues. The following violin plots show the distribution of length for each SSE.
Sequence of SSEs
[edit]The amino acid sequences for each SSE can be aligned and used to produce a sequence logo. Where the sequence conservation is sufficient, we can establish a generic numbering scheme: the most conserved residue in helix X serves as its reference residue and is numbered as @X.50. The remaining residues in the helix are numbered accordingly.
- Helices
- Beta strands