Given a crystal's unit-cell dimensions (and, optionally, space group) this program searches the Protein Data Bank (PDB) for structures having similar unit cells. Matching structures are grouped into families based on sequence and the results are returned within about 1s when using the command-line application with multiprocessing enabled. Due to the overhead of web services however, this page will take a few seconds to return the results to you. Our test cases have shown a ceiling of approximately 15 seconds.
We have created a database of all PDB entries which includes a table of reduced P1 cells derived from the Phenix "explore_metric_symmetry" routine. This table also stores Cartesian coordinates for the three points (A, B and C) corresponding to the fractional coordinates (1,0,0), (0,1,0) and (0,0,1). This database is updated weekly as new PDB entries are added.
When you provide unit-cell dimensions (and, optionally, a space group) above, a CGI script invokes a parallelized application (written mainly in C++, with elements of Python and Fortran thrown in). First it converts the input unit-cell into the reduced P1 cell using Phenix and calculates the coordinates of the points A, B and C as described above. Along with the origin, O, these form a set of four points which can be superposed onto the equivalent points derived from each PDB entry and the root-mean-square (RMS) difference calculated. Note that all right-handed combinations of (O, A, B, C) are tried and only the smallest RMS difference is recorded. All PDB entries for which the RMS difference is <= 2.5
are considered as matches and these are then used for sequence-based clustering. The new version of Nearest-cell is implemented almost entirely in Python.
Clustering is based on information recorded in the SEQRES records of each PDB entry and is a two-stage process based on a slight modification of CD-HIT. First, all the distinct peptide chains in all the close PDB matches are passed to CD-HIT for clustering into families. We consider two sequences to be in the same family if they share >90% sequence identity over >80% of the residues in the shorter sequence. SEQRES records describe each individual protein species in the asymmetric unit, so a second step of processing is required in order to cluster at the level of PDB entries. The contents of an asymmetric unit can be simply described by the number of instances of each sequence family it contains. Multiplying by the number of symmetry operators then describes the content of the unit cell. PDB entries are clustered together if they contain the same families of proteins and the same number of each family in the unit cell.
The output shows the member of each PDB cluster (family) with the smallest RMS difference to the query unit-cell, thus giving a brief overview (usually) in no more than 10 lines of output, and this can be expanded to show all the PDB matches in each cluster.
Usage and Scripted Queries
Important: Please allow the script a few seconds to run. Non-P1 input cells will cause Phenix to run to derive the proper P1 unit cell. This takes a few extra seconds. On this note, exotic alternate space groups not supported by Phenix will probably not work. The software is able to cope with some of these cases (list coming soon).
Unit cells should be provided in the order of a,b,c,alpha,beta,gamma and should be comma-separated or space-separated:
e.g. 345 345 345 90 90 90
Letters in space groups can be capitalized or lowercase (they will be converted to capitals internally), and may contain spaces (which will be removed automatically):
e.g. P212121 or i 2 3
If you wish to invoke this script through your own code via HTTP GET, please formulate your URLs strictly according to the following examples (i.e. commas between unit-cell parameters and no spaces throughout):
Example (Working link): http://www.strubi.ox.ac.uk/nearest-cell/nearest-cell.cgi?unit-cell=24,24,24,90,90,90&space-group=R3:R
Another example (automatically assumes space group is P1): http://www.strubi.ox.ac.uk/nearest-cell/nearest-cell.cgi?unit-cell=141.8,116.7,66.4,90,90,90
Citation and Publication
Please cite the following paper if you use Nearest-cell:
Ramraj, V., Evans, G., Diprose, J. M. & Esnouf, R. M. (2012). Acta Cryst. D68, 1697-1700.
- BASH shell
- A recent build of Phenix (with sourced phenix environment variables)
- Python (>= 2.7) (in addition to Phenix' bundled Python)
- GNU command line utilities (grep, zcat, gzip etc.)
- libpq, libpqxx (Postgres C, C++ libraries respectively)
- libxml2 (XML parsing)
- libexpat (XML parsing)
- gcc, g++, gfortran (>= 4.3) (f77 WILL NOT WORK)
- GNU autotools (autoconf, automake, libtool, m4)
This program is FREE SOFTWARE and is licensed under the GNU GPL v3.
This software was developed by Varun Ramraj, Robert Esnouf and Jonathan Diprose at the Division of Structural Biology, University of Oxford.