Usage¶
IDPConformerGenerator runs entirely through command-lines. Follow the explanations in this page plus the documentation on the command-line themselves.
Command-lines¶
To execute idpconfgen
command-line, run idpconfgen
in your
terminal window, after installation:
idpconfgen
or:
idpconfgen -h
Both will output the help menu.
Note
All subclients have the -h
option to show help information.
idpconfgen
has several interfaces that perform different functions.
However, there is a sequence of interfaces that need to be used to prepare the
local torsion angle database and the files needed to build conformers. After
these operations executed, you will end up with a single json
file that
you can use to feed the build calculations. The other files are safe to be
removed.
IDPConfGen Small Peptide Example¶
The example/
folder contains instructions to setup
IDPConformerGenerator database from scratch and generate conformers for a small
peptide.
Note
The purpose of this module is to test the installation of IDPConformerGenerator on a small example peptide. The database you will procure here will not be used further cases, much less in practice due to the database housing only 100 PDB IDs.
To build the torsion angle database you need to provide IDPConfGen with a list of PDB/mmCIF files. Our advice is that you use a culled list of your choice from the Dunbrack PISCES database.
The PDB chain id list should have the format of the cull100
file in this
folder, which emulates the format provided by PISCES. Actually, the only column
that is read by IDPConfGen is the first column. No header lines are allowed.
12E8H 221 XRAY 1.900 0.22 0.27
16PKA 415 XRAY 1.600 0.19 0.23
1A05A 358 XRAY 2.000 0.20 0.28
1A0JA 223 XRAY 1.700 0.17 0.21
1A12A 413 XRAY 1.700 0.19 0.22
1A1XA 108 XRAY 2.000 0.21 0.25
The first three alphanumeric characters are the PDBID codes. The forth (or more) are the PDB chain identifier. mmCIF files have chain IDs of several characters.
Run the following command to download the PDB files:
idpconfgen pdbdl cull100 -u -n -d pdbs.tar
You can inspect all options of this (and any other) subclient with -h
:
idpconfgen pdbdl -h
This execution will create a pdbs.tar
file with the parser PDBs. Those
PDBs contain only the information needed for IDPConfGen. Unnecessary chains or
residues were removed.
We now proceed to the identification of secondary structure elements. For this you need to have DSSP program installed. IDPConfGen is built to be modular. You can also use any other program to calculate secondary structure but you would need to implement the respective parser. The DSSP parser is implemented. To install DSSP follow these instructions: https://github.com/julie-forman-kay-lab/IDPConformerGenerator/issues/48.
The following command will operate on the pdbs.tar
file and will create
temporary files and a result file with the DSSP information:
idpconfgen sscalc pdbs.tar -rd -n
You will see that the files sscalc.json
and sscalc_splittled.tar
were created. sscalc.json
matches the sequence information with that of
the secondary structure identity. sscalc_splitted.tar
contains the PDB
chains split into continuous chains.
Now we need to calculate the torsion angles. There are several options available in the command line but these are good defaults:
idpconfgen torsions sscalc_splitted.tar -sc sscalc.json -o idpconfgen_database.json -n
This will create the final file idpconfgen_database.json
. This is the
file IDPConfGen needs to generate conformers. This is the torsion angle database
file. If you open it, you will see it is a regular human-readable json
file.
Finally, to generate conformers you will use the build
interface. The
build interface has several parameters that can be use to fine-tune the
conformer construction protocol. You can read deeper instructions in the
documentation and client help. The following is a good default that uses
the FASPR method for adding side chains:
idpconfgen build -db idpconfgen_database.json -seq EGAAGAASS -nc 10 --dloop-off --dany -n
After some time you will see 10 conformers in the folder.
Please note that searching for loops is enabled by default for --dloop
.
Appending --dhelix --dstrand
will extend sampling to alpha-helicies and
beta-strands in addition to loops. For more information on usage, please view idpconfgen build -h
.
If you would like to use a program to assign secondary structure bias based on NMR chemical shifts (e.g. CheSPI or δ2D) to
employ probabilistic custom secondary structure sampling (CSSS), the probs8_[ID].txt
output
from CheSPI or .TXT output from δ2D would have to be standardized into a user-editable text file indicating the
probability of secondary structures (based on DSSP codes) on a per residue basis.
The following example will process CheSPI output and assign probabilities to L/H/E based on H/G/I/E/ /T/S/B
structures on a per residue basis:
idpconfgen csssconv -p8 probs8_ex.txt -o csss_ex.json
For simplicity, secondary structures from CheSPI and δ2D are grouped into L/H/E as defined by idpconfgen.
If you do not want this grouping feature, please build the database above without -rd
and run csssconv
with --full
to avoid grouping.
To build with the CSSS file, -csss
would have to point to the CSSS.JSON file:
idpconfgen build -db idpconfgen_database.json -seq EGAAGAASS -nc 10 -csss csss_ex.json --dloop-off -n
After some time you will see 10 conformers in the folder with the probabilistic CSSS.
All IDPConfGen operations can be distributed over multiple cores. Use the flag
-n
to indicate the number of cores you wish to use. Appending only -n
will use all available CPU threads except for one.
This is all you need to know for a basic usage of IDPConfGen. Now you can use a larger PDB database to feed the conformational sampling.
Building With Variable Bond-Geometry Strategies¶
To increase user flexibility and build parameterization,.IDPConformerGenerator
has 4 bond geometry strategies to choose from: sampling (default), fixed, exact,
and int2cart. These strategies could be selected from using the --bgeo-strategy
flag during the build
process. Please note that a new database needs to be
generated with the bgeodb
subclient to use the exact
strategy, and Int2Cart
will need to be installed to use the int2cart
strategy (see below).
The default sampling
strategy aims to overcome limitations of having fixed bond
angles and bond lengths to increase the diversity of conformations sampled. The fixed
strategy uses average bond geometries on a per-residue basis derived from an extended
Dunbrack PISCES database cull_d200611/200611/cullpdb_pc90_res1.6_R0.25_d200611_chains8807
.
The exact
method uses exact bond/bend angles and bond lengths for each residue in a
fragment sampled from the database. To initialize the new, backwards-compatible, database
use the bgeodb
module on the idpconfgen_database.json
previously generated using
torsions
. Reminder, this requires the sscalc_splitted.tar
file generated by sscalc
in the early stages of creating the database:
idpconfgen bgeodb sscalc_splitted.tar -sc idpconfgen_database.json -o idpconfgen_extended_database.json -n
A Real Case Scenario¶
Note
The purpose of this module is to use IDPConformerGenerator as you would with a real protein of interest (unfolded drkN SH3 is presented here). The database procured from this module can be re-usable for future cases.
The example with a small peptide in the example
folder is a good way to
get introduced to IDPConfGen. Although building other IDP conformer ensembles
use the same workflow as the previous example, we will go over more detailed
usage examples with a well studied IDP, the unfolded state of the drkN SH3 domain.
Chemical shift data for the unfolded state of the drkN SH3 domain (BMRB ID: 25501) has been already processed with
δ2D and CheSPI and secondary structure propensity calculations can be found in
example/drksh3_example
as drk_d2D.txt
and probs8_25501_unfolded.txt
respectively for δ2D and CheSPI output.
An extensive culled list is in cull.tar
. Unpack it with:
tar -xf cull.tar
This is the same culled list used in the IDPConfGen main publication. However feel free to choose your own from the Dunbrack PISCES database.
Steps from now on will assume you’re in the working directory of example/drksh3_examples
.
To initialize the database if you do not already have one, we must download the PDB files from our culled list (can be found in the supplemental package in the IDPConformerGenerator paper):
idpconfgen pdbdl cullpdb_pc90_res2.0_R0.25_d201015_chains24003 -u -n -d pdbs.tar
Next we will create temporary files storing the secondary structure information for each PDB file downloaded. Later to be processed for their torsion angles:
idpconfgen sscalc pdbs.tar -rd -n -cmd <DSSP EXEC>
Please note that since IDPConfGen is a toolkit, many of these modules can be used with
custom folders or .tar
files.
Finally, torsion angles are extracted and the database we will use for future calculations
can be created with the torsions
subclient:
idpconfgen torsions sscalc_splitted.tar -sc sscalc.json -n -o idpconfgen_database.json
Now we’re ready to construct multiple conformer ensembles for the unfolded states of the drkN SH3 domain. To build 100 conformers, sampling only the loop region, the default limits to the backbone and side chain L-J energy potentials are be 100 kJ and 250 kJ respectively, using default fragment sizes, no substitutions, and to have side chains added with FASPR:
idpconfgen build \
-db idpconfgen_database.json \
-seq drksh3.fasta \
-nc 100 \
-of ./drk_L+_nosub_faspr \
-n
idpconfgen
is deterministic. Therefore, the random seed defines the sampling progression -
read here for more information.
To switch the side chain building algorithm to MC-SCE (recommended), you would first have to install MC-SCE. Please re-visit the installation page to get MC-SCE set up. Here’s the following example:
idpconfgen build \
-db idpconfgen_database.json \
-seq drksh3.fasta \
-nc 100 \
-scm mcsce \
-of ./drk_L+_nosub_mcsce \
-n
Note
Running MC-SCE within IDPConformerGenerator can be memory (RAM) intensive. Consider running with a lower number of CPU threads using the -n flag if necessary.
The defaults for --mcsce-n_trials
is 16 while using the --mcsce-mode exhaustive
, however
we recommend trials larger or equal to 100 for smaller conformer pools. In this exercise, we will be using the
default MC-SCE side chain building mode simple
.
If you’re encountering an error with MC-SCE running interally through IDPConformerGenerator, we recommend you to generate backbones first, then pack sidechains after. For example, these would be the commands, required to generate backbones first and then sidechains.:
idpconfgen build \
-db idpconfgen_database.json \
-seq drksh3.fasta \
-nc 100 \
-dsd \
-of ./drk_L+_nosub_bb \
-n
# Make the output folder for MC-SCE
mkdir ./drk_L+_nosub_mcsce
# Run MC-SCE in the idpconfgen environment because it's already installed
mcsce ./drk_L+_nosub_bb 100 \
-w \
-o ./drk_L+_nosub_mcsce \
-s \
-m simple \
-l drk_L+_nosub_mcsce_log
As stated in the idpconfgen build -h
, sampling using other secondary structure
parameters required --dloop
to be turned off --dloop-off
. For example, if we’d like to
sample only helices and extended strands:
idpconfgen build \
-db idpconfgen_database.json \
-seq drksh3.fasta \
-nc 100 \
-et 'pairs' \
--dstrand \
--dhelix \
--dloop-off \
-of ./drk_H+E+_nosub \
-n
For sampling loops, helices, and strands, we would specify --dhelix --dstrand
where --dloop
is turned on by default. However, sampling without biasing for secondary structure
can be done with --dany --dloop-off
.
To sample using custom secondary structure sampling (CSSS) a CSSS database (.JSON) file needs
to be created specifying the secondary structure probabilities for each residue. This can be
done using the makecsss
module if chemical shift data is not readily available, if you’d
like to edit a pre-existing CSSS.JSON, or create a new file. Here’s an example for making a
custom CSSS.JSON file that samples only helices for residues 15-25 of the unfolded state of the drkN SH3 domain
and loops for everything else:
idpconfgen makecsss -cp 1-14 L 1.0|15-25 H 1.0|26-59 L 1.0 -o cust_csss_drk.json
If chemical shift files are readily available, consider using CheSPI or δ2D to generate the CSSS.JSON.
δ2D predictions have been included in the example/drksh3_ex_resources
folder as drk_d2D.txt
.
CheSPI probs8_*
predictions have been included in the example/drksh3_ex_resources
folder
as probs8_25501_unfolded.txt
.
To convert output from δ2D to CSSS, use the csssconv
subclient with flag -d2D
:
idpconfgen csssconv -d2D drk_d2D.txt -o csss_drk_d2D.json
To convert output from CheSPI to CSSS, use the csssconv
subclient with flag -p8
:
idpconfgen csssconv -p8 probs8_25501_unfolded.txt -o csss_drk_chespi.json
The outputted csss_*.json
files will be used for the -csss
flag in the build
subclient.
For example, constructing 100 conformers for the unfolded state of the drkN SH3 domain using the δ2D predictions and the same settings for
energy and MC-SCE as above:
idpconfgen build \
-db idpconfgen_database.json \
-seq drksh3.fasta \
-nc 100 \
-csss csss_drk_d2D.json \
--dloop-off \
-et 'pairs' \
-of ./drk_CSSSd2D_nosub \
-n
The default fragment size probabilities for building are (1, 1, 3, 3, 2) for fragment sizes of (1, 2, 3, 4, 5) respectively.
To change this, we would have to create a .TXT
file with two columns, the first specifying what fragment sizes
from lowest to highest, the second specifying their relative probabilities. We have provided an example in
example/drksh3_ex_resources
as customFragments.txt
. To use these custom fragment size probabilities with CSSS:
idpconfgen build \
-db idpconfgen_database.json \
-seq drksh3.fasta \
-nc 100 \
-xp customFragments.txt \
-csss csss_drk_d2D.txt \
--dloop-off \
-et 'pairs' \
-of ./drk_fragN_CSSSd2D_nosub \
-n
Finally, to expand torsion angle sampling beyond the residue identity, we can provide a residue tolerance map using the -urestol
flag in the
build
subclient. For this example, we will be using columns 5, 3, and 2 from the EDSSMat50
substitution matrix:
idpconfgen build \
-db idpconfgen_database.json \
-seq drksh3.fasta \
-nc 100 \
--dany \
--dloop-off \
-urestol '{"R":"RK","D":"DE","C":"CY","C":"CW","Q":"QH","E":"ED","H":"HYQ","I":"IVM","I":"IL","K":"KR","M":"MI","M":"MVL","F":"FY","F":"FWL","W":"WYFC","Y":"YF","Y":"YC","Y":"YWH"}' \
-et 'pairs' \
-of ./drk_ANY_sub532 \
-n
Please note for the above run, we are sampling the torsion angle database disregarding secondary structure
with the --dany
flag.
Hopefully this more in-depth realistic example with the unfolded state of the drkN SH3 domain has provided you with the utilities and usage examples to explore IPDConfGen more with your custom protein systems.
Modeling Disordered Region Tails on a Folded Domain¶
Note
When modeling multi-chain complexes with the ldrs
subclient,
the FASTA file format for the -seq
parameter must be as follows with no
blank spaces.
>A
Sequence for chain A
>B
Sequence for chain B
If you would like to skip a chain while modeling multi-chain complexes,
you must have the identical sequence in the .fasta
file to the chains
in the template you would like to skip.
Clash-checking and will be done with skipped-chains in consideration.
The following example will walk you through building N-terminal and C-terminal
IDR tails on folded regions using the Local Disordered Region Sampling (LDRS
ldrs
subclient).
For this exercise, we will be constructing the tails on the human CNOT7
deadenylase protein. Please enter the example example/cnot7_example
folder where you will find the complete CNOT7 sequence: cnot7.fasta
, and a
PDB of the folded region from PDB ID 4GMJ: 4GMJ_CNOT7.pdb
.
Note
If your input PDB has phosphorylated residues such as phosphorylated
threonine and serine, please change the three-letter code in the PDB file
indicating the residue label to the non-modified version. For example:
TPO
phosphorylated threonine will become THR
and SEP
phosphorylated serine will become SER
.
Using the resre
subclient can help you with this.
Steps from now will assume you’re in the example/cnot7_example
directory and
have already created your preferred reusable IDPConformerGenerator database.
For instructions on the database, please visit the previous exercise “A Real
Case Scenario”.
To generate the disordered terminal tails on CNOT7 run the following command:
idpconfgen ldrs \
-db <PATH TO DATABASE.JSON> \
-seq cnot7.fasta \
-etbb 100 \
-etss 250 \
-nc 100 \
-fld 4GMJ_CNOT7.pdb \
-of ./cnot7_ldrs_L+_faspr \
-n
The ldrs
subclient with automatically detect the N-IDR and C-IDR tail based on mismatches
in the primary sequence of the .fasta
file (or input sequence from -seq
) and the PDB
file of the folded domain. This command took approximately 3 minutes on a single workstation with
64 GB DDR4 RAM and 50 CPU threads (-n 50
) clocked at 3.0 GHz.
To check your outputs against what is to be expected for this tutorial section. Please click
here
and download the archive named cnot7_ldrs_example.zip
.
Note
Sidechain clashes may appear if you use the FASPR method for packing on sidechains
above. To guarantee no sidechain clashes, we recommended either lowering the steric-clash
tolerance using the -tol
flag above or generating backbone-only conformers first
then packing sidechains later with MC-SCE as described below.
To generate backbone-only IDR tails on CNOT7 then pack sidechains on the IDRs with MC-SCE.
We will be using agnostic secondary structure sampling here with --dany
.:
idpconfgen ldrs \
-db <PATH TO DATABASE.JSON> \
-seq cnot7.fasta \
--dloop-off \
--dany \
-etbb 100 \
-dsd \
-nc 1000 \
-fld 4GMJ_CNOT7.pdb \
-of ./cnot7_ldrs_ANY_bb \
-n
idpconfgen resre \
./cnot7_ldrs_ANY_bb/ \
-of ./cnot7_ldrs_ANY_bb_resre \
-pt 126:HIP,157:HIP,225:HIP,249:HIP,258:HIP, \
-n
mkdir cnot7_ldrs_ANY_mcsce
mcsce \
./cnot7_ldrs_ANY_bb_resre \
64 \
-w \
-o ./cnot7_ldrs_ANY_mcsce \
-l ./mcsce_log \
-s \
-m simple \
-f 12-262
Note
You can access the MC-SCE software here to ignore folded regions and add post-translational modifications during the sidechain packing process.
If you run into an error with mcsce
and your input PDB has histines labled as HIS
,
please change the three-letter code in the PDB file to HIP
to account for all
protonation states.
Using the resre
subclient like so above can help you with this.
Any other parameters will only impact the disordered regions generated. Additional settings
include -tol
and -kt
, where the former sets a tolerance for clashes between the
disordered tail(s) and folded domain while the latter acts as a switch to retain the
disordered tail(s) generated in the building process. By default, disordered tail only
conformers are deleted after full length conformers are generated.
Modeling Disordered Regions Within Folded Domains¶
The following example will walk you through building an 86-residue-long IDR
connecting two folded domains. By now, we expect you to be familiar with the
Local Disordered Region Sampling (LDRS ldrs
subclient). If not, please visit
the cnot7_example
.
For this exercise, we will construct the intrinsically disordered region from
residues 568-652 on the STAS domain of SLC26A9 (PDB ID 7CH1, UniProt A0A7K9KBT8).
In the example/slc26a9_example
folder you will find the complete FASTA
sequence: SLC26A9_STAS.fasta
, and a PDB of the folded region from PDB ID
7CH1: 7CH1_SLC26A9.pdb
.
Note
Modeling an IDR between folded regions may take a while depending on various factors such as the length of the IDR to model, the distance between the chain breaks, the location of the chain break, and the presence of folded parts that restrict the growth of the chain.
To continue the tutorial, navigate to the example/slc26a9_example
directory.
Ensure you have already created your preferred reusable IDPConformerGenerator
database. For instructions on the database, please visit the previous exercise,
“A Real Case Scenario”.
We will be using the ldrs
subclient to model 50 conformations of the
intrinsically disordered region. Sidechain clashes may exist if you decide
to use the FASPR method for generating sidechains like so below.:
idpconfgen ldrs \
-db <PATH TO DATABASE.JSON> \
-seq SLC26A9_STAS.fasta \
-etbb 100 \
-etss 250 \
-nc 50 \
--dloop-off \
--dany \
-fld 7CH1_SLC26A9.pdb \
-of ./slc26a9_ldrs_ANY_faspr \
-n
From the .fasta
file, the ldrs
subclient will automatically identify the
N-IDR, the C-IDR, and any IDRs missing between folded domains; and construct
those. The speed of Linker-IDR generation varies with sequence length as well as its relative
position to the folded domain.
To guarantee no sidechain clashes, we recommend either lowering the steric-clash
tolerance using the -tol
flag above or generating backbone-only conformers first
then packing sidechains later with MC-SCE as described below in the Advanced LDRS
Usage section.
To check your outputs against what is to be expected for this tutorial section. Please click
here
and download the archive named slc26a9_ldrs_example.zip
.
Advanced LDRS Usage¶
For those more proficient in Python, the modularity of LDRS allows users access to advanced features that we describe here. Such features include profiting from better parallelization that allows modeling longer IDRs in a shorter time (e.g., 221 residues in one or two days). Here, we explain how to use the modularity of IDPConformerGenerator to exploit its total capacity for modeling IDRs by writing two new Python scripts that import IDPConfGen machinery.
For template PDB structures, please ensure they have the element name at the
second last column in the PDB file. If you’re unsure about the formatting, you can
use the Export Molecule
feature in PyMOL.
The element name column will be automatically added.
The logic behind the LDRS subclient for modeling and IDR connecting two folded domains assumes that we have an N-IDR-like case at the C-terminal region of the first folded domain and a C-IDR-like case at the N-terminal region of the second domain. Thus, when defining the IDR sequence in the fast file given to the -seq parameter, we need to provide two overlapping residues at each. Those will be “QK” and “LA” in this example.
We have already prepared the IDR sequence for this example; see the
SLC26A9_IDR.fasta
file. You can perform sequence alignment between the
SLC26A9_IDR.fasta
file and the slc26a9_example.fasta
file to visualize
where the IDR fits within the whole protein sequence.
Here is a brief overview of what we will do to speed up the process of closing the chain break (L-IDR) with an all-atom IDR model. We show two scripts you can use as templates for using the LDRS features of IDPConformgerGenerator.
Generate 10,000 backbone-only structures (more = better sampling) of the IDR:
The idea is to have IDPConformerGenerator generate a large library of IDPs that
may represent the IDR to model. We are exploiting IDPConformerGenerator’s speed
and diversity for generating conformers. The time rate-limiting step here is the
next-seeker
protocol (see scripts), where we have to compare all of the
structures in slc26a9_cterm
to all structures in slc26a9_nterm
to find
our candidates for sidechain addition.
Create the necessary folders for the script to run:
3. Change the paths in the script slc26a9_shortcut.py
and run it; always use
the idpconfgen
Python environment.
4. Use psurgeon()
in slc26a9_stitching.py
script to attach the all-atom
IDR models to the folded domain. The output for this will be in slc26a9_results
.
Use the
resre
module to rename anyHIS
toHIP
that exist after stitching:Model the sidechains onto the backbone-only L-IDRs stitched onto the folded region generated previously in the
results
folder using the MC-SCE software:
To further save time, especially on a computing cluster, we can split the
conformers in the nterm
folder and run jobs in parallel or request more workers.
Furthermore, the conformers in slc26a9_results
can be split to run mcsce
in
parallel as well. Please note that this shortcut is not a memory-intensive
task, so 8 GB of RAM is sufficient to run the next-seeker
protocol.
Processing Low-Confidence Predicted Residues¶
Although the ldrs
subclient accepts any PDB or mmCIF file to be used
as a template. A script has been prepared in this folder /remove_lowconfidence_residues.py
to automate the removal of low-confidence predicted residues.
Sample thresholds have been given within the Python script that uses the idpconfgen
environment based on the source of each structure prediction algorithm. For example,
a threshold of 70 needs to be applied to AlphaFold structures and 0.7 for ESMFold structures.
Please edit the input_file
, output_file
and threshold
variables in the script
before running with python remove_lowconfidence_residues.py
in the idpconfgen
Python
environment.
Note
It is sometimes preferable to remove residues manually using a molecular viewer such as PyMOL to avoid ending up with short segments of confident, yet unwanted residues.
For example, 2 “confident” residues in between 10 unconfident residues should be removed as well for optimal performance.
Modeling Disordered Regions in a Multi-Chain Protein Complex¶
The following example will walk you through building intrinsically disordered
regions on multiple chains of a protein complex using the Local Disordered
Region Sampling (LDRS ldrs
subclient).
Note
Please ensure all sequences are within the .FASTA file even for chains you are not interested in LDRS processing. This will help LDRS determine which chain to process.
Please refer to D1D2.fasta
for a formating example.
For this exercise, we will be constructing a combination of the three cases of
N-IDR, L-IDR, and C-IDR on both chain A and chain B of the crystal structure of the
D1D2 sub-complex from the human SNRNP core domain. Please enter the example
example/complex_example
folder where you will find the complete set of
sequences: D1D2.fasta
, and a PDB of the complex from the
RCSB PDB ID 1B34: 1B34.pdb
.
Steps from now will assume you’re in the example/complex_example
directory and
have already created your preferred reusable IDPConformerGenerator database.
For instructions on the database, please visit the previous exercise “A Real
Case Scenario”.
Due to the automated process of multi-chain detection and building we can generate a set of 10 structures with a single command:
idpconfgen ldrs \
-db <PATH TO DATABASE.JSON> \
-seq D1D2.fasta \
-nc 10 \
-fld 1B34.pdb \
--dloop-off \
--dany \
-of ./D1D2_ldrs_ANY_faspr \
-n
The ldrs
subclient with automatically detect the all IDRs and their corresponding
chains based on sequence similarity and mismatches in the primary sequence of the .fasta
file and the PDB file of the folded domain. This command took approximately an hour on a
single workstation with 64 GB DDR4 RAM and 10 CPU threads (-n 10
) clocked at 3.0 GHz.
To check your outputs against what is to be expected for this tutorial section. Please click
here
and download the archive named d1d2_complex_ldrs_example.zip
.
Exploring IDPConfGen Analysis Functions¶
Our vision for IDPConformerGenerator as a platform includes the analysis of your database and the PDBs generated by IDPConfGen.
To get started, the stats
subclient is a quick way to check how many hits for different sequence fragment matches you will
find in the database for your protein system of choice. It is also possible to include different secondary structure filters as well
as amino-acid substitutions to get a more accurate representation the number of hits in the database for your system:
idpconfgen stats \
-db idpconfgen_database.json \
-seq drksh3.fasta \
--dloop-off \
--dany \
-op drk_any \
-of ./drk_any_dbStats
Another tool to investiagte the database is the search
subclient. To use this, you will need a tarball or folder of raw PDBs required
from the fetch
subclient. The search
function goes through the PDB headers to find keywords of your choice and returns the
number of hits and their associated PDBIDs in .JSON format:
idpconfgen fetch \
../cull100 \
-d ./cull100pdbs/ \
-u \
-n
idpconfgen search \
-fpdb ./cull100pdbs/ \
-kw 'thermococcus,pro,beta' \
-n
After generating conformer ensembles with IDPConfGen, it is possible to do some basic plotting with the integrated plotting flags
in the torsions
and sscalc
subclients. For torsions
, you can choose to plot either omega, phi, or psi dihedral
angle distributions in a scatter plot format. For sscalc
, fractional secondary structure will be plotted in terms of DSSP codes
as well as fractions from the alpha, beta, or other regions of the Ramachandran space for your conformers of choice. The following example
plots the psi angle distributions and the fractional secondary structure of the drk_CSSSd2D_nosub_mcsce
ensemble generated in the previous
module:
idpconfgen torsions \
./drk_CSSSd2D_nosub_mcsce \
-deg \
-n \
--plot angtype=psi xlabel=drk_residues
To plot the fractional Ramachandran space information:
idpconfgen torsions \
./drk_CSSSd2D_nosub_mcsce \
-deg \
-n \
--ramaplot filename=fracDrkRama.png colors=['o', 'b', 'k']
To plot the fractional secondary structure information:
idpconfgen sscalc \
./drk_CSSSd2D_nosub_mcsce \
-u \
-rd \
-n \
--plot filename=dssp_reduced_drk_.png
To see which plotting parameters can be modified, please refer to src/idpconfgen/plotfuncs.py
. We have given a short list of modifyable parameters here:
--plot title=<TITLE> title_fs=<TITLE FONT SIZE> xlabel=<X-AXIS LABEL> xlabel_fs=<X-AXIS LABEL FONT SIZE> colors=<LIST_OF_COLORS>
Exploring MC-SCE and Int2Cart Integrations¶
Integrating the functions from our collaborators at the Head-Gordon Lab, IDPConformerGenerator has the ability to build with bond geometries derived from a recurrent neural network machine learning model Int2Cart. Furthermore, as we introduced the MC-SCE method for building sidechains in the previous modules, we would like to provide some examples on changing the default sidechain settings.
To use the Int2Cart method for bond geometries, the --bgeo-strategy
flag needs to be defined with
int2cart
duringthe building stage:
idpconfgen build \
-db idpconfgen_database.json \
-seq drksh3.fasta \
-etbb 100 \
-etss 250 \
-nc 100 \
-csss csss_drk_d2D.json \
--dloop-off \
-et 'pairs' \
-scm mcsce \
--bgeo-strategy int2cart \
-of ./drk_CSSSd2D_nosub_int2cart_mcsce \
-n
To change the number of trials for MC-SCE to optimize success rate and overall speed:
idpconfgen build \
-db idpconfgen_database.json \
-seq drksh3.fasta \
-etbb 100 \
-etss 250 \
-nc 100 \
-csss csss_drk_d2D.json \
--dloop-off \
-et 'pairs' \
-scm mcsce \
--mcsce-n_trials 64 \
-of ./drk_CSSSd2D_nosub_32_trials_mcsce \
-n
How to Efficiently Set Jobs up for HPC Clusters¶
Using the sethpc
subclient, users can generate bash scripts for SLURM managed
systems. Due to architecture of Python’s multiprocessing module, IDPConformerGenerator is
unable to utilize the resources of multiple nodes on HPC clusters. However, with sethpc
,
users are able to request multiple nodes per job and sethpc
will automatically generate
the SBATCH scripts needed, along with an all*.sh
and cancel*.sh
script to run/cancel
all of the jobs generated with ease.
Please note that on many HPC resources (such as Graham) your queuing priority will not change requesting 5 nodes per job or 1 node per 5 jobs, but this should be confirmed.
If multiple nodes are requested, at the end of all jobs, the merge
subclient can be run to
merge all of the conformers generated into one folder with the option of modify the naming-pattern
for each structure. Please see below for an example of running sethpc
and merge
.
To request 3 nodes to generate 512,000 structures of the unfolded state of the drkN SH3 domain with 10 hours per node:
idpconfgen sethpc \
-des ./drk_hpc_jobs/ \
--account def-username \
--job-name drk_hpc \
--nodes 3 \
--ntasks-per-node 32 \
--mem 16g \
--time-per-node 0-10:00:00 \
--mail-user your@email.com \
-db idpconfgen_database.json \
-seq drksh3.fasta \
-etbb 100 \
-etss 250 \
-nc 512000 \
-csss csss_drk_d2D.json \
--dloop-off \
-et 'pairs' \
-scm mcsce \
--bgeo-strategy int2cart \
-of /scratch/user/drk/ \
-n 32 \
-rs 12
To merge all of the folders created by the multi-node jobs:
idpconfgen merge \
-tgt /scratch/user/drk/ \
-des /scratch/user/drk/drk_CSSSd2D_nosub_multiple_mcsce \
-pre drk_confs \
-del
Using IDPConfgen as Python library¶
To use IDPConformerGenerator in your project, import it as a library:
import idpconfgen
From within the Python prompt you can get information on each module, class, and
function with help(idpconfgen)
. You can also access the whole API
documentation here at the reference page.