We provide an original dataset that was prepared by us to compile the 2010 Rotamer Library itself. We have put a lot of efforts to prepare reliable data for the library computation with a series of 10 steps. We briefly describe the preparation process below. For more details, please refer to the publication indicated below, more specifically to Supplemental Experimental Procedures, Dataset preparation describing all 10 steps.
The dataset is stored in a single self-descriptive file. It has a very simple text format which is very to parse. Its header contains brief instructions, definitions and descriptions. The file has more than 600 thousand record lines; each line is for amino acid residue.
A few of extracts of the dataset are shown below. To download the full version, please apply for a license.
We first determined the full list of protein-containing PDB entries for which we could obtain electron densities from the Uppsala Electron Density Server (EDS). We have shown previously that side chains with sp3-sp3 hybridized bonds with nonrotameric dihedral angles, those far from the typical mean values for (60°, 180°, 300°), have much lower electron density than average.
This list was then filtered by the PISCES server and run through the SIOCS program to flip Asn, Gln, and His terminal dihedral angles to account for hydrogen bonding. We obtained a list of 3,985 protein chains from 3,845 entries with resolution better than or equal to 1.8A, an R-factor cutoff of 0.22, and mutual sequence identity of the chains of 50% or less.
We calculated the electron density at the atom coordinates of 3,985 chains and computed the geometric mean of the electron density at the atomic positions in each residue as a quality filter to remove disordered residues - those with electron densities in the bottom 25th percentile for each residue type. For the rotamer library calculations, the resulting number of residues totaled unique 581,128. We also accounted for incorrectly modeled leucine residues, and we analyzed trans and cis prolines separately, as well as disulfide-bonded and nondisulfide-bonded cysteines.
DatasetForBBDepRL2010.txt # ALL COMMENT LINES START WITH "# ", PLEASE IGNORE THESE LINES WHEN PARSING # # Dataset used in computing of '2010 Backbone-dependent Rotamer Library' # Copyright (c) 2007-2012 # Maxim V. Shapovalov and Roland L. Dunbrack Jr. # Fox Chase Cancer Center # Philadelphia, PA, USA # # File was generated in February, 2012 # # =============================================================================== # Please cite this paper when publishing results based on our dataset or library: # =============================================================================== # Shapovalov, M.S., and Dunbrack, R.L., Jr. (2011). # "A smoothed backbone-dependent rotamer library for proteins derived from adaptive kernel density estimates and regressions." Structure, 19, 844-858. # # Column title abbreviations used: # -------------------------------- # # * RES - three-character code for the i-th amino acid residue type # # * PDB_ID - four-character PDB entry name # * CHAIN_ID - one-character chain ID # * RES_ID - residue ID from ATOM-record lines # ***** WARNING: It includes insertion code which may be not numerical; treat this column as a string # * ALTER_ID - one-character alternative conformation ID # ***** WARNING: Please treat `!` alternative conformations as a single confirmation. # A space or no character in this field is replaced with `!`. This is done for easier parsing. # # * OMEGA - omega torsion angle for i-th residue. # A traditional definition is used, i.e. omega(i) precedes i-th residue not follows it: # CA(i-1) - C(i-1) - N(i) - C(i). For example, cis proline means the cis peptide bond is before the proline. # # * CHI1 - chi1 torsion angle # * CHI2 - chi2 torsion angle (for residue types with no chi2 angle, NaN is provided instead, i.e. NaN = Not A Number) # * CHI3 - chi3 torsion angle (for residue types with no chi3 angle, NaN is provided instead) # * CHI4 - chi4 torsion angle (for residue types with no chi4 angle, NaN is provided instead) # # * BBPERC - electron density percentile for a group of the backbone atoms # * SCPERC - electron density percentile for a group of the side-chain atoms # * RSPERC - electron density percentile for a group of all residue atoms # # * FLP_STATE - flip state for Asn, Gln and His concluded by Siocs software # Possible values: # `Kept` - original conformation is kept by Siocs # `Flipped` - original conformation is flipped by Siocs # `UNDEF` - no information is provided by Siocs for some reason # `N./A.` - non-applicable value for the rest of residue types # # * FLP_CONFID - `Flipped` or `Kept` confidence code for Asn, Gln and His by Siocs software # Possible values: # `clear` - high level of confidence for either `Flipped` or `Kept` by Siocs software # `probable` - middle level of confidence by Siocs software # `unsure` - low level of confidence by Siocs software # `UNDEF` - no information is provided by Siocs for some reason # `N./A.` - non-applicable value for the rest of residue types # # * SS - one-letter secondary-structure codes for (i-1)-th, i-th and (i+1)th residues # # For details on electron density percentiles, please refer to the following paper: # # Shapovalov MV, Dunbrack RL Jr. # "Statistical and conformational analysis of the electron density of protein side chains." Proteins, 2007 Feb 1;66(2):279-303. # # The following residue types are provided: ARG ASN ASP CPR CYD CYH CYS GLN GLU HIS ILE LEU LYS MET PHE PRO SER THR TPR TRP TYR VAL # ----------------------------------------- # # TPR are trans prolines # CPR are cis prolines # PRO include both trans and cis prolines # # CYH are nondisulfide-bonded cysteines # CYD are disulfide-bonded cysteines # CYS include both nondisulfide-bonded and disulfide-bonded cysteines # # All columns are tab-delimited. # ------------------------------ # # RES PDB_ID CHAIN_ID RES_ID ALTER_ID OMEGA PHI PSI CHI1 CHI2 CHI3 CHI4 BBPERC SCPERC RSPERC SS FLP_STATE FLP_CONFID # ARG 1a2p A 69 ! 175.764 -97.348 125.795 171.866 189.419 286.267 148.456 7.484 59.358 40.532 TTC N./A. N./A. ARG 1a2p A 72 ! 173.119 -134.179 161.280 294.586 180.232 176.276 280.520 51.134 40.316 42.979 EEE N./A. N./A. ... SER 1a2p A 38 A -173.147 -70.162 -11.738 288.835 NaN NaN NaN 78.319 52.440 68.080 GGG N./A. N./A. SER 1a2p A 50 ! 172.341 -126.085 152.719 290.465 NaN NaN NaN 80.896 91.812 87.794 TEE N./A. N./A. ... GLN 1a2p A 15 ! -174.003 -71.334 -15.915 288.411 198.670 -48.133 NaN 8.143 79.352 49.233 HHH Kept clear GLN 1a2p A 31 A 178.544 -57.875 -41.798 288.200 170.764 -35.176 NaN 63.021 37.101 44.161 HHH Kept clear GLN 1a2p A 104 ! 179.566 -79.879 -41.775 293.901 181.327 95.711 NaN 20.989 40.133 33.230 TTT Kept clear GLN 1a3a A 31 ! 177.202 -74.527 -29.147 297.233 309.943 -45.855 NaN 86.350 69.375 76.526 HHH Flipped clear ... PRO 1a2p A 21 ! 172.859 -62.606 165.053 -3.922 5.290 -4.354 NaN 24.435 63.777 38.465 CTT N./A. N./A. PRO 1a2p A 47 ! -174.356 -54.393 131.720 -16.341 16.766 -10.288 NaN 48.102 56.398 51.823 TTT N./A. N./A. ... CPR 1a4i A 102 ! 0.047 -82.903 156.917 29.614 -34.140 24.199 NaN 41.514 32.754 37.018 TTT N./A. N./A. CPR 1a4i A 272 ! 0.843 -78.813 -172.686 25.221 -23.577 12.734 NaN 29.712 21.141 25.246 TTT N./A. N./A. ... TPR 1a2p A 21 ! 172.859 -62.606 165.053 -3.922 5.290 -4.354 NaN 24.435 63.777 38.465 CTT N./A. N./A. TPR 1a2p A 47 ! -174.356 -54.393 131.720 -16.341 16.766 -10.288 NaN 48.102 56.398 51.823 TTT N./A. N./A. ... ASN 1a2p A 5 ! -175.640 -141.896 23.545 72.171 -16.270 NaN NaN 39.409 46.826 44.700 CCC Kept clear ... ASN 1a3a A 120 ! 179.248 -74.424 -29.777 296.868 -20.719 NaN NaN 20.311 37.946 30.339 HHH Flipped clear ASN 1a4i A 8 ! -175.298 -83.178 93.920 190.943 -23.282 NaN NaN 36.811 35.545 36.161 CCH Kept clear ... CYH 1a3a A 82 ! -175.486 -116.448 122.085 306.796 NaN NaN NaN 94.560 87.639 94.289 EEE N./A. N./A. CYH 1a4i A 147 ! -170.428 -67.142 -32.160 306.345 NaN NaN NaN 69.943 28.254 50.404 CHH N./A. N./A. ... CYD 1a7s A 26 ! 179.191 -156.003 165.151 271.016 NaN NaN NaN 85.909 41.553 70.490 EEE N./A. N./A. CYD 1a7s A 123 ! -179.581 -136.108 177.296 306.932 NaN NaN NaN 58.133 34.026 47.400 EEE N./A. N./A. ... CYS 1a3a A 82 ! -175.486 -116.448 122.085 306.796 NaN NaN NaN 94.560 87.639 94.289 EEE N./A. N./A. |
Maxim Shapovalov and Roland Dunbrack
A smoothed backbone-dependent rotamer library for proteins derived from adaptive kernel density estimates and regressions. Shapovalov, M.V., and Dunbrack, R.L., Jr., Structure 2011, 19, 844-858. Article