Jump to content

SecStrAnnotator:Analysis

From WebChemistry Wiki

SecStrAnnotator Suite provides scripts (Python, R) for batch annotation of the whole family and analysis of the annotation results.

Procedure

[edit]

Data preparation

[edit]

The directory scripts/secstrapi_data_preparation/ contains a pipeline for annotating the whole protein family, including:

  • downloading the list of family members defined by CATH and Pfam,
  • downloading their structures,
  • selecting a non-redundant set,
  • annotation,
  • multiple sequence alignment for individual SSEs,
  • formatting into SecStrAPI format,
  • formatting into TSV format for further analyses.

The whole pipeline can be executed by scripts/SecStrAPI_pipeline.py

Example usage:

 python3 scripts/SecStrAPI_pipeline.py scripts/SecStrAPI_pipeline_settings.json --resume

Before running, modify the settings in SecStrAPI_pipeline_settings.json to set your family of interest, annotation template, data directory etc (see README.txt for more details).

Data analysis

[edit]

The directory scripts/R_sec_str_anatomy_analysis/ contains a pipeline for statistical analysis of the annotation results on the whole protein family, including:

  • reading the annotation results from TSV,
  • generating plots,
  • performing statistical test to compare eukaryotic and bacterial structures (or any two sets of structures).

Example usage:

  • Launch rstudio from the said directory
  • In sec_str_anatomy.R, set DATADIR to the path to your annotation data created in #Data preparation
  • In sec_str_anatomy_settings.R, modify the family-specific settings (list of helices and strands)
  • Run sec_str_anatomy.R line by line

Example case study: Cytochromes P450

[edit]

Data

[edit]

For the Cytochrome P450 family, structures of 1855 protein domains are available, located in 1012 PDB entries (updated on 7 July 2020). The analysis was performed on a non-redundant subset containing 183 protein domains.

The data are available here (structural files not included because of their size).

Occurrence of SSEs

[edit]

The occurrence describes in what percentage of the structures a particular SSE is present.

  • Occurrence of particular SSEs in the whole set.
  • Occurrence of particular SSEs – comparison of bacterial and eukaryotic structures.

Length of SSEs

[edit]

The length of an SSE is measured as the number of residues. The following violin plots show the distribution of length for each SSE.

  • Length distribution of particular SSEs in the whole set.
  • Length distribution of particular SSEs – comparison of bacterial and eukaryotic structures.

Sequence of SSEs

[edit]

The amino acid sequences for each SSE can be aligned and used to produce a sequence logo. Where the sequence conservation is sufficient, we can establish a generic numbering scheme: the most conserved residue in helix X serves as its reference residue and is numbered as @X.50. The remaining residues in the helix are numbered accordingly.




Back to the main page