Jump to content

ValidatorDB: Difference between revisions

From WebChemistry Wiki
Deepti (talk | contribs)
No edit summary
Replaced content with "Content of this page was moved here."
 
(122 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
Content of this page was moved [[ValidatorDB:UserManual | here]].
The advancement of research in structural biology has provided a large body of structural data deposited in various databases. One great example is the Protein Data Bank (PDB), which has been growing exponentially, and which currently consists of more than 100,000 structures of biomolecules and their complexes. Such large bodies of data, especially accumulated over a short period of time using high throughput techniques, will inherently be plagued by various problems.
 
Validation arose as a major issue in the structural biology community when it became apparent that some published structures contained serious errors, either documented (e.g., due to insufficient electron density in a certain area), or not. Structural databases generally require that the new submissions be checked prior to acceptance. The tools employed for pre submission validations work fairly well for well studied residues like amino acids or nucleotides. However, an essential step in the validation process is checking the ligand structure, because ligands play a key role in protein function, and also because they are the main source of errors in structures. Ligand validation, as well as the validation of uncommon residues, are very challenging tasks, because of the high diversity and nontriviality of their structure, and the general lack of information about correct structures. Therefore, software tools focused on ligand validation were developed relatively recently,<ref>Lütteke T, von der Lieth C-W. BMC Bioinformatics (2004) 5, 69.</ref><ref>Kleywegt GJ, Harris MR. Acta Crystallographica D. Biological Crystallography (2007) 63: 935–8.</ref>, and the topic is still under active development3. These tools are able to validate one or more structures (even thousands of structures), but they are not able to provide the broad scientific community with a more complex image of the quality of structures in dedicated and well established structural databases. For example, a general overview and corresponding statistical evaluation of validation results for residues and ligands in the entire PDB is not yet available, despite the exponential growth of the PDB and the development of structural validation tools in recent years.
 
We had recently developed MotiveValidator,,4, an interactive platform for the speedy validation of ligands, residues and fragments using a novel, straightforward approach based on the validation of residue annotation. MotiveValidator employs advanced algorithms for the detection and comparison of structural motifs5, along with tools for chirality verification6 and interactive visualization of 3D structures7. Using MotiveValidator, we further created ValidatorDB, a comprehensive resource of validation results for residues and ligands in the Protein Data Bank. Along with validation results for individual residues and ligands, ValidatorDB also provides a summary and statistical evaluation of the validation results at various levels of detail within the PDB. Thus, ValidatorDB offers a comprehensive overview of the quality of the ligand structures in the entire PDB.
 
ValidatorDB contains precomputed validation results for ligands and residues in the Protein Data Bank. The database is updated on a weekly basis.
 
The residues deemed relevant for validation are all ligands and residues with reasonable size (more than six heavy atoms), with the exception of standard amino acids and nucleotides. The validation is performed using MotiveValidator, and the residue models from wwPDB Chemical Component Dictionary (wwPDB CCD) are used as reference structures for validation.
 
==Availability and technical details==
 
===Where to find ValidatorDB===
ValidatorDB is freely available online since May 2014 at http://ncbr.muni.cz/MotiveValidatorDB. There is no login requirement for accessing ValidatorDB
 
===What you need in order to access ValidatorDB===
ValidatorDB is basically a database, or rather a collection of validation results for ligands and residues in Protein Data Bank. The database is maintained on the ncbr.chemi.muni.cz server at the
National Centre for Biomolecular Research within Masaryk University, Czech Republic, and updated weekly. All you need in order to access ValidatorDB is an internet browser that is up to date and has JavaScript enabled, and a working internet connection. The only functionality that relies on your system is the display of 3D models, for which your browser will need to support WebGL. If you experience trouble displaying the 3D models, please check http://get.webgl.org in order to find out how to enable WebGL on your system.
 
===How to get around the web page===
 
As soon as you type in the address http://ncbr.muni.cz/
ValidatorDB, you will reach the ValidatorDB synopsis page, which contains a brief, general description of ValidatorDB , along with 3 tabs (Figure 1A). The different tabs on the ValidatorDB synopsis page provide access to overviews and statistical evaluation of validation results for the entire PDB, for each residue across all PDBIDs containing that residue, and for all analyzed residues in each PDB ID, in graphical or tabular form. Click on each tab to discover what type of overview can be accessed. Further, the ValidatorDB specifics page(Figure 1B), which is accessible by looking up specific residues or PDB IDs in the synopsis page
, allows to view the results for selected residues in more
detail. The specifics page is also organized into tabs that allow different levels of analysis of the results. Last but not least, remember to check the tool tips by hovering the mouse cursor over any graphical or textual element in the ValidatorDB interface.
 
------------------------------------------------
Before moving on to more extensive descriptions of features, it is important to clearly establish the meaning of a few key terms and principles within the ValidatorDB environment. See  [[MotiveValidator:Terminology|Terminology]]
 
------------------------------------------------
 
==Basic Principles==
 
===Residues and ligands relevant for validation===
As mentioned in section 1, well studied residues like amino acids and nucleotides are routinely validated upon submission of new structures to the PDB. Furthermore, reports of the quality of their
structure are already accessible. The challenge addressed by
ValidatorDB lies in providing access to validation results for residues other than the well studied amino acids and nucleotides. This generally includes ligands and uncommon residues (e.g., substituted amino acids), which exhibit high diversity and nontriviality in their structure, and for which there is generally much less information regarding correct structures. Thus, within the ValidatorDB environment, we further refine the meaning of the terms residue and ligand to refer to residues and ligands relevant for validation. Specifically, these are all ligands
and residues with reasonable size (more than six heavy atoms), with the exception of amino acids and nucleotides. All other features of the terms residue and ligand described in sections
2.1 and 2.2 are maintained. Henceforth, all references to residues
and ligands in this manual will have the meaning of residues and ligands relevant for validation. Similarly, all references to
residues and ligands in the ValidatorDB web pages (including Wiki and tutorial) have the meaning of residues and ligands relevant for validation. The PDB currently holds over 17000 residues and ligands relevant for validation.
 
===Validation===
[[File:VDB_manual_figures2.png|thumb|right|Caption for the image]]
As stated in section, the validation results stored in ValidatorDB
are updated every week. Within the ValidatorDB environment, the term validation refers to the process of determining whether a residue or ligand is structurally complete and correctly annotated. This means checking if the topology and chirality of each motif of a validated residue (section 2.4) correspond to those of the model residue (section 2.5) with the same name as the validated residue.
The validation of residues and ligands in the entire PDB takes place in a few distinct steps.
First, for each PDB entry, the residues which are relevant for validation are detected based on their name (3-letter code) and number of atoms (more than 6 heavy atoms). Amino acid residues and
nucleotides are excluded based on their residue name. Then, for each validated residue, the corresponding model (same 3-letter code as the validated residue) is retrieved from wwPDB CCD,
and each motif of this residue is validated against the model.
ValidatorDB is then built as the collection of validation results for all motifs of all residues in all PDB entries(Figure 2A).
The validation of each motif against the model residue can be illustrated on a galactose (GAL)motif from the PDB entry 1bzw
(Figure 2B). The validated residue GAL is extracted form PDB
entry 1bzw in the form of an input motif , which contains all the atoms of the validated residue, together with all atoms found within one or two bonds of any atom from the validated residue
(surroundings). Then, by superimposing the input motif and
model residue, the validated motif is obtained as the subset of atoms in the input motif which have a correspondent atom in the model residue. Comparing each atom and bond in the validated motif to those in the model residue produces the validation results.
 
==Validation results==
For each validated motif, ValidatorDB contains several types of results. Since the evaluation of the validated motif relies on comparing all atoms and bonds in the validated motif to those in the model residue, the first results that can be encountered are
errors. Namely:
 
*Missing atoms: an atom in the model residue has no corresponding  atom in the validated motif.
 
*Missing rings: at least one missing atom originates from cycles (rings).
 
*Wrong chirality: an atom from the validated motif has different chirality than the corresponding atom from the model residue.
 
*Wrong chirality (planar): the chirality error was found on a planar chiral center. Because of their spacial distribution, planar chiral centers are very sensitive even to small perturbations in the position of the substituents. Therefore, some of the errors reported here might not be significant.
 
*Uncertain chirality: the presence of unusual bonds may cause an improper evaluation of chirality.
 
Chirality is only evaluated for those motifs which are complete. This is because the absence of some atoms can prevent the proper evaluation of chirality on the chiral centers present in the
validated motif. Therefore, note that all motifs which are counted in the Wrong chirality category are in fact complete. At the same time, the motifs with no missing atoms and no chirality error are
actually counted in a category called Correct chirality.
 
--------------------------
Suspicious discrepancies between the atoms and inter-atomic bonds in the validated motif and in the model residue are reported as
warnings. Namely:
 
*Substitution: an atom from the validated motif is of a different chemical element than the corresponding atom in the model residue (e.g. O mapped to N). This happens often at linkage sites.
 
*Different naming: an atom from the validated motif has a different PDB atom name than the corresponding atom from the model residue (e.g. the C1 atom mapped to the C7 atom). This happens often when the original PDB files were produced by different software.
 
*Foreign atom: an atom from the model residue was mapped to an atom from outside the validated residue (i.e. from its surroundings).
 
*Alternate locations: in the original PDB file, the validated residue contains atoms which were given in alternate locations (i.e., most probably different rotamers). Only the first rotamer was considered during validation.
 
*Zero model RMSD: the superimposition between the model residue and the validated motif has a root mean square deviation of zero, i.e., the validated motif is identical to the model residue used as reference.
 
Disabling discrepancies between the atoms and inter-atomic bonds in the validated motif and in the model residue are reported as
processing errors, and such motifs are not validated.
 
Typical validation results that can be found in ValidatorDB are
illustrated on the galactose motif mentioned in section 2.6(Fig 2C). As a general rule, in the ValidatorDB interface, errors are
marked in red (missing atoms) or dark yellow (wrong chirality), correct structures in green, and warnings in cyan.
 
==Database contents==
ValidatorDB contains precomputed validation results for ligands and residues in the Protein Data Bank. The database is updated on a weekly basis. The validation is performed using Validator, and the residue models from wwPDB CCD are used as reference templates for validation. All residues of significant size (a minimum of 6 heavy atoms) have been included in ValidatorDB, with the exception of amino acids and nucleotides, which are checked thoroughly upon submission of the structure to the PDB, and thus do not require additional validation.
 
The validation results available in ValidatorDB inform whether each motif (occurrence, instance) of a ligand or residue in the PDB exhibits the appropriate topology and stereochemistry expected from its annotation (3-letter code), or how it differs from this annotation. Additionally, all issues related to incorrect or suspicious topology and stereochemistry are explicitly described in a comprehensive and intuitive manner (e.g., location of missing atoms or chirality inversions).
 
ValidatorDB is organized on two main levels, namely PDB-wide results (synopsis page), and results restricted to specific residues of interest (specifics page). The two levels present the same type of validation results (as described in section 2.7), although the available features differ to some extent (e.g., the
specifics page allows 3D visualization of motifs). We shall describe each level of the database in detail below.
 
==Synopsis page==
 
The ValidatorDB synopsis page(Figures1A, 3) contains a brief description of ValidatorDB, along with information about the last database update (date and number of structures that have been
processed during the validation). Specifically, in May 2014, over 100,000 PDB entries had been processed, containing over 230,000 motifs of 17,000 residues relevant for validation.
Additionally, the synopsis page allows to access the validation results for specific residues of interest via the LookUp bar
(Figure 1A). Simply type a comma separated list of residue names (3-letter codes) into the LookUp bar, and you will be redirected to the specifics page containing validation results for the residues you requested. If you specify a list of PDB IDs (4-letter codes)instead, then the corresponding specifics page will contain validation results for all relevant residues and ligands in the PDB entries you specified. See section 3.2 for a description of the
contents of the specifics page, and how to interpret these contents.
The ValidatorDB synopsis page further provides access to various data sets of PDB-wide validations via 3 different tabs, namely
Overview, Details by Residue, and Details by PDB entry. A full description of each of these tab is given below (sections 2.1.1-
2.1.3).
 
====Overview====
The Overview tab of the synopsis page provides a very general statistical evaluation of results across the entire PDB in graphical form (Figures 1A,3A). The elements of the graph represent
percentages of the total number of motifs (over 200,000) of residues relevant for validation.
A graphic element will be displayed in the Overview graph only if it represents at least 0.5% of the total number of motifs.
Each element of the graph is described in a tool tip, but note that here the term residue actually refers to occurrence of residue
(motif).
 
The elements of the graph can be assigned to roughly 6 categories, depending on which kind of information they contain (e.g., incomplete residue, chirality issues, warnings, etc.). The categories are marked by different colors (Figure 3A). Most of the graph elements have been explained in section 3.7 of this manual. The additional elements are Analyzed, which refers to the total number of motifs that could be processed, Missing Atoms or Rings,
which is the sum of Missing (Only) Atoms and Missing Rings, and
Has All Atoms and Rings, which is the total number of complete
residues.
 
====Details by Residue====
 
The Details by Residue tab (Figure 3B) contains an interactive table summarizing the results for each residue validated across the entire PDB. Each row corresponds to one residue, identified by
its residue name (3-letter code). The information in the table is organized according to the validation results as presented in section 2.7 of this manual. The color coding for the table header
and the font inside the table is the same as in the categories defined in the Overview tab. Each element of the table header is described in a tool tip, but note that here the term residue
actually refers to occurrence of residue (motif).
 
The table is interactive. Clicking on any element in the table header allows to sort the table entries according to that element. Click on any residue name in order to access the ValidatorDB
specifics page with detailed validation results for that residue (see section 3.2).
 
The filter at the top right corner allows to retrieve the table row with a specific residue. Simply type the residue name into the filter. All results can be downloaded in .csv format using the
download button at the top left corner.
 
====Details by PDB entry====
 
The Details by PDB Entry tab (Figure 3C) contains an interactive table summarizing the results for all residues validated in each PDB entry. Each row corresponds to one PDB entry, identified by
its PDB ID (4-letter code). The information in the table is organized according to the validation results as presented in section 2.7 of this manual. The color coding for the table header and the font inside the table is the same as in the categories defined in the Overview tab. Each element of the table header is described in a tool tip, but note that here the term residue
actually refers to occurrence of residue (motif).
 
The table is interactive. Clicking on any element in the table header allows to sort the table entries according to that element. Click on any PDB in order to access the ValidatorDB specifics
page with detailed validation results for all residues in that PDB entry (see section 3.2).
 
The filter at the top right corner allows to retrieve the table rows with a specific residue, or the table rows with selected PDB IDs. Simply type the residue name or PDB ID into the filter. All
results can be downloaded in .csv format using the download button at the top left corner.
 
==Specifics page==
 
The ValidatorDB specifics page is accessible from the synopsis page, either via the LookUp bar on the Overview tab (Figure 1A), or via the residue names and PDB IDs in the interactive tables on the tabs Details by Residue(Figure 3B) and Details by PDB Entry
(Figure 3C), respectively. Depending on how it was accessed, the
specifics page might retrieve validation results for one or more residues, a fact mentioned at the very top of the page(Figure 1B).
 
The ValidatorDB specifics page (Figures 1B,4,5)provides a straightforward report of the validation results, including a summary and detailed information in both tabular and graphical form, along with a 3D structure visualizer for closer inspection of the problematic structures. These reports are accessible via several tabs on the specifics page, namely Summary, Details, Processing Errors/Warnings, and Overview. These tabs will be described in detail in sections 3.2.1 - 3.2.4.Inspecting the tabular and graphical validation reports accessible on the specifics page is the most comfortable and effective way to evaluate the results. Additionally, you may use the JSON Data download button at the top right corner of the specifics page in order to download the complete validation reports and perform any additional analyses on your own.
 
====Summary====
 
On the ValidatorDB specifics page, the first view of the results is available in the Summary tab (Figures 1 B,4A). For each validated residue, the Summary tab provides an overview of potential issues encountered, as described in section 2.7.
 
If more than one residue were validated in one run, a list of these residues will be at the top of the page. In order to examine the validation summary for each residue, you will need to either click on that specific residue in the list, or just scroll down the page till you reach it. Each validated residue is identified by its 3-letter code, as well as its chemical formula and common name. Validation statistics are given as absolute numbers and percentages over all the motifs that were processed for each residue.
 
The table with the validation report is organized into two main sections, referring to incomplete(Missing Atoms or Rings) and complete structures (With All Atoms and Rings) respectively. The
formal distinction between ring atoms and non-ring atoms (simply denoted as atoms ) is meant to allow a quick localization of potential issues in residues containing rings, especially where atom identifiers are not useful. Chirality is evaluated only for the complete structures, since the absence of some atoms makes it difficult to check the chirality of some of the remaining atoms. Further, the problematic atoms are highlighted, in order to better localize the problems in the structures.
 
Last, a 2D representation of the model residue, and a pie chart with the validation results are provided for visual representation purposes. You can download them via the small icon at the top
right corner of the chart, and later use them in your presentations.
 
====Details====
 
Whereas the Summary tab provides statistics of the issues over all validated motifs for each validated residue, the Details tab of the
ValidatorDB specifics page allows you to inspect the issues in select groups of motifs, and further in each individual motif(
Figure 5A). Note that you may also access the details of any particular group of motifs also by clicking on a specific issue in any Summary tab table.
 
The Details tab is organized into a table where each row contains information regarding a single validated motif. The content of the table (i.e., which motifs are included, and what information is displayed) is dictated by the values of three selection fields at the top of the table. Click on the first field, and select the
validated residue by its name (3-letter code) from the drop down menu. Only the motifs that were matched to that residue name will be displayed in the table. Click on the second field and select the
type of issue (e.g., wrong chirality) from the drop down menu. Only the motifs which exhibit that type of issue will be displayed in the table. The number of motifs that fit each selection is given in brackets. If you want to make your selection even more
specific, use the selection filed Id filter.
 
Which table columns are filled depend mostly on the type of issue selected in the filter. The most important columns are Id, Issues/Warnings, Missing atoms/rings, Atoms, Processing warnings.
The other columns give additional information, usually helpful in identifying the source of the error in the structure. The column
Id refers to a unique identifier assigned to each motif in order to keep a transparent trace of the motif's origin, as it contains the PDB ID, as well as the serial index of the first atom in the motif, as it appears in the original PDB entry. The column Issues/Warnings reports the number of issues or warnings found for each particular motif. The column Missing atoms/rings explains which atoms are missing in each validated motif, whereas Atoms shows the position of incorrect chirality. Missing atoms are  listed by their atom identifier in the model, whereas atoms with wrong chirality are listed by their identifier in the validated motif. Clicking on a column header allows to sort the motifs according to the property specified in the header.
 
====3D visualization====
The 3D viewer implemented in the ValidatorDB interface offers one step further in the analysis of each individual validated motif, and is accessible via the Details tab on the specifics page(Figure 5B). In the table, simply click on the Id of a motif of interest in order to open the 3D viewer, where you can inspect the structural inaccuracies more closely. Here you will be able to view and manipulate with the 3D representations of the validated motif and model residue, to help you better assess the position and relevance of the structural issues found during validation. Additionally, a 2D representation of the model is provided for clarity, which is especially helpful for larger motifs.
Basic information about the validated motif is also given, along with a complete report of the validation results, where all the potential issues are listed.
 
====Processing warnings and processing errors====
 
The validation reports in ValidatorDB also mention various unusual aspects encountered during validation. Sometimes the processed PDB entries contain information that is ambiguous, conflicting or which deviates strongly from the expected reference. ValidatorDB
reports such events as processing warnings or processing errors,
depending on the severity of the deviations. Such information can be found in the Processing Errors/Warnings tab on the specifics page. The selection field at the top of the page helps filter the warnings and errors. Simply click on the drop down menu and select the category of warnings or errors that you would like to explore.
Processing warnings are issues that may cause incorrect Validation, such as atoms that are too close in the 3D space, or unusual bond lengths given by the CONECT records. ValidatorDB
typically reports several kinds of warnings: substitutions, foreign atoms, different naming, alternate locations and zero model RMSD, planar chiral center, unusual bond lengths, etc. It is always good to check and make sure that negative validation results (e.g., missing atoms) are not in fact caused by ignoring some atoms in an ill-formed structure.
 
Processing errors are major issues preventing the finalization of the validation, such as parts of the residue which are completely disconnected from the rest of the structure, probably due to
missing atoms at multiple locations throughout the structure. Any major errors in the input file, such as atoms that are completely disconnected from the rest of the structure, are reported as
processing errors, and these structures are not processed at all.
 
It is important to note the difference between processing warnings
and processing errors. A warning may simply lead to ignoring a faulty atom, but the motif is validated. On the other hand, a
processing error prevents entire motifs from being validated, so you will not find these motifs in the statistics available on either the synopsis or specifics page. The number of motifs with processing errors can be easily calculated as the difference between total motifs of relevant residues in the PDB, and the number of analyzed motifs (currently around 2000 motifs, about 0.5% of relevant motifs in the PDB). Further, because ValidatorDB
automatically extracts all motifs of a relevant residue and assigns them a unique and informative motif Id, you will be able to easily find the motif in its original PDB entry, and explore it.
 
====Overview====
 
To keep consistency with the synopsis page, the specifics page
also allows visualization of general validation statistics for a selected number of residues via the Overview tab. This representation is entirely compatible with that of the Overview
tab on the synopsis page(Figure 5A), and in fact makes up a subset of that data set. All color coding conventions are kept, and tool
tips provide descriptions of each graphical element.
 
==References==
{{Reflist}}

Latest revision as of 10:44, 7 October 2014

Content of this page was moved here.