SecStrAnnotator:Analysis: Difference between revisions
No edit summary |
|||
(3 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
SecStrAnnotator Suite provides scripts (Python, R | SecStrAnnotator Suite provides scripts (Python, R) for batch annotation of the whole family and analysis of the annotation results. | ||
==Procedure== | ==Procedure== | ||
Line 12: | Line 12: | ||
* annotation, | * annotation, | ||
* multiple sequence alignment for individual SSEs, | * multiple sequence alignment for individual SSEs, | ||
* formatting into SecStrAPI format, | * formatting into [[SecStrAnnotator:SecStrAPI#SecStrAPI_format | SecStrAPI format]], | ||
* formatting into TSV format for further analyses. | * formatting into TSV format for further analyses. | ||
The whole pipeline can be executed by <code>scripts/SecStrAPI_pipeline.py</code> | |||
Example usage: | Example usage: | ||
python3 scripts/SecStrAPI_pipeline.py scripts/SecStrAPI_pipeline_settings.json --resume | |||
Before running, modify the | Before running, modify the settings in <code>SecStrAPI_pipeline_settings.json</code> to set your family of interest, annotation template, data directory etc (see <code>README.txt</code> for more details). | ||
===Data analysis=== | ===Data analysis=== | ||
Line 40: | Line 42: | ||
===Data=== | ===Data=== | ||
For the Cytochrome P450 family, structures of | For the Cytochrome P450 family, structures of 1855 protein domains are available, located in 1012 PDB entries (updated on 7 July 2020). The analysis was performed on a non-redundant subset containing 183 protein domains. | ||
The data are available [https:// | The data are available [https://doi.org/10.5281/zenodo.3939133 here] (structural files not included because of their size). | ||
===Occurrence of SSEs=== | ===Occurrence of SSEs=== |
Latest revision as of 09:43, 16 July 2020
SecStrAnnotator Suite provides scripts (Python, R) for batch annotation of the whole family and analysis of the annotation results.
Procedure
[edit]Data preparation
[edit]The directory scripts/secstrapi_data_preparation/
contains a pipeline for annotating the whole protein family, including:
- downloading the list of family members defined by CATH and Pfam,
- downloading their structures,
- selecting a non-redundant set,
- annotation,
- multiple sequence alignment for individual SSEs,
- formatting into SecStrAPI format,
- formatting into TSV format for further analyses.
The whole pipeline can be executed by scripts/SecStrAPI_pipeline.py
Example usage:
python3 scripts/SecStrAPI_pipeline.py scripts/SecStrAPI_pipeline_settings.json --resume
Before running, modify the settings in SecStrAPI_pipeline_settings.json
to set your family of interest, annotation template, data directory etc (see README.txt
for more details).
Data analysis
[edit]The directory scripts/R_sec_str_anatomy_analysis/
contains a pipeline for statistical analysis of the annotation results on the whole protein family, including:
- reading the annotation results from TSV,
- generating plots,
- performing statistical test to compare eukaryotic and bacterial structures (or any two sets of structures).
Example usage:
- Launch
rstudio
from the said directory - In
sec_str_anatomy.R
, set DATADIR to the path to your annotation data created in #Data preparation - In
sec_str_anatomy_settings.R
, modify the family-specific settings (list of helices and strands) - Run
sec_str_anatomy.R
line by line
Example case study: Cytochromes P450
[edit]Data
[edit]For the Cytochrome P450 family, structures of 1855 protein domains are available, located in 1012 PDB entries (updated on 7 July 2020). The analysis was performed on a non-redundant subset containing 183 protein domains.
The data are available here (structural files not included because of their size).
Occurrence of SSEs
[edit]The occurrence describes in what percentage of the structures a particular SSE is present.
Length of SSEs
[edit]The length of an SSE is measured as the number of residues. The following violin plots show the distribution of length for each SSE.
Sequence of SSEs
[edit]The amino acid sequences for each SSE can be aligned and used to produce a sequence logo. Where the sequence conservation is sufficient, we can establish a generic numbering scheme: the most conserved residue in helix X serves as its reference residue and is numbered as @X.50. The remaining residues in the helix are numbered accordingly.
- Helices
- Beta strands