Usage¶

IDPConformerGenerator runs entirely through command-lines. Follow the explanations in this page plus the documentation on the command-line themselves.

Command-lines¶

To execute idpconfgen command-line, run idpconfgen in your terminal window, after installation:

idpconfgen

or:

idpconfgen -h

Both will output the help menu.

Note

All subclients have the -h option to show help information.

idpconfgen has several interfaces that perform different functions. However, there is a sequence of interfaces that need to be used to prepare the local torsion angle database and the files needed to build conformers. After these operations executed, you will end up with a single json file that you can use to feed the build calculations. The other files are safe to be removed.

IDPConfGen Small Peptide Example¶

The example/ folder contains instructions to setup IDPConformerGenerator database from scratch and generate conformers for a small peptide.

Note

The purpose of this module is to test the installation of IDPConformerGenerator on a small example peptide. The database you will procure here will not be used further cases, much less in practice due to the database housing only 100 PDB IDs.

To build the torsion angle database you need to provide IDPConfGen with a list of PDB/mmCIF files. Our advice is that you use a culled list of your choice from the Dunbrack PISCES database.

The PDB chain id list should have the format of the cull100 file in this folder, which emulates the format provided by PISCES. Actually, the only column that is read by IDPConfGen is the first column. No header lines are allowed.

12E8H       221  XRAY        1.900    0.22    0.27
16PKA       415  XRAY        1.600    0.19    0.23
1A05A       358  XRAY        2.000    0.20    0.28
1A0JA       223  XRAY        1.700    0.17    0.21
1A12A       413  XRAY        1.700    0.19    0.22
1A1XA       108  XRAY        2.000    0.21    0.25

The first three alphanumeric characters are the PDBID codes. The forth (or more) are the PDB chain identifier. mmCIF files have chain IDs of several characters.

Run the following command to download the PDB files:

idpconfgen pdbdl cull100 -u -n -d pdbs.tar

You can inspect all options of this (and any other) subclient with -h:

idpconfgen pdbdl -h

This execution will create a pdbs.tar file with the parser PDBs. Those PDBs contain only the information needed for IDPConfGen. Unnecessary chains or residues were removed.

We now proceed to the identification of secondary structure elements. For this you need to have DSSP program installed. IDPConfGen is built to be modular. You can also use any other program to calculate secondary structure but you would need to implement the respective parser. The DSSP parser is implemented. To install DSSP follow these instructions: https://github.com/julie-forman-kay-lab/IDPConformerGenerator/issues/48.

The following command will operate on the pdbs.tar file and will create temporary files and a result file with the DSSP information:

idpconfgen sscalc pdbs.tar -rd -n

You will see that the files sscalc.json and sscalc_splittled.tar were created. sscalc.json matches the sequence information with that of the secondary structure identity. sscalc_splitted.tar contains the PDB chains split into continuous chains.

Now we need to calculate the torsion angles. There are several options available in the command line but these are good defaults:

idpconfgen torsions sscalc_splitted.tar -sc sscalc.json -o idpconfgen_database.json -n

This will create the final file idpconfgen_database.json. This is the file IDPConfGen needs to generate conformers. This is the torsion angle database file. If you open it, you will see it is a regular human-readable json file.

Finally, to generate conformers you will use the build interface. The build interface has several parameters that can be use to fine-tune the conformer construction protocol. You can read deeper instructions in the documentation and client help. The following is a good default that uses the FASPR method for adding side chains:

idpconfgen build -db idpconfgen_database.json -seq EGAAGAASS -nc 10 --dloop-off --dany -n

After some time you will see 10 conformers in the folder. Please note that searching for loops is enabled by default for --dloop. Appending --dhelix --dstrand will extend sampling to alpha-helicies and beta-strands in addition to loops. For more information on usage, please view idpconfgen build -h.

If you would like to use a program to assign secondary structure bias based on NMR chemical shifts (e.g. CheSPI or δ2D) to employ probabilistic custom secondary structure sampling (CSSS), the probs8_[ID].txt output from CheSPI or .TXT output from δ2D would have to be standardized into a user-editable text file indicating the probability of secondary structures (based on DSSP codes) on a per residue basis. The following example will process CheSPI output and assign probabilities to L/H/E based on H/G/I/E/ /T/S/B structures on a per residue basis:

idpconfgen csssconv -p8 probs8_ex.txt -o csss_ex.json

For simplicity, secondary structures from CheSPI and δ2D are grouped into L/H/E as defined by idpconfgen. If you do not want this grouping feature, please build the database above without -rd and run csssconv with --full to avoid grouping. To build with the CSSS file, -csss would have to point to the CSSS.JSON file:

idpconfgen build -db idpconfgen_database.json -seq EGAAGAASS -nc 10 -csss csss_ex.json --dloop-off -n

After some time you will see 10 conformers in the folder with the probabilistic CSSS.

All IDPConfGen operations can be distributed over multiple cores. Use the flag -n to indicate the number of cores you wish to use. Appending only -n will use all available CPU threads except for one.

This is all you need to know for a basic usage of IDPConfGen. Now you can use a larger PDB database to feed the conformational sampling.

Building With Variable Bond-Geometry Strategies¶

To increase user flexibility and build parameterization,.IDPConformerGenerator has 4 bond geometry strategies to choose from: sampling (default), fixed, exact, and int2cart. These strategies could be selected from using the --bgeo-strategy flag during the build process. Please note that a new database needs to be generated with the bgeodb subclient to use the exact strategy, and Int2Cart will need to be installed to use the int2cart strategy (see below).

The default sampling strategy aims to overcome limitations of having fixed bond angles and bond lengths to increase the diversity of conformations sampled. The fixed strategy uses average bond geometries on a per-residue basis derived from an extended Dunbrack PISCES database cull_d200611/200611/cullpdb_pc90_res1.6_R0.25_d200611_chains8807. The exact method uses exact bond/bend angles and bond lengths for each residue in a fragment sampled from the database. To initialize the new, backwards-compatible, database use the bgeodb module on the idpconfgen_database.json previously generated using torsions. Reminder, this requires the sscalc_splitted.tar file generated by sscalc in the early stages of creating the database:

idpconfgen bgeodb sscalc_splitted.tar -sc idpconfgen_database.json -o idpconfgen_extended_database.json -n

A Real Case Scenario¶

Note

The purpose of this module is to use IDPConformerGenerator as you would with a real protein of interest (unfolded drkN SH3 is presented here). The database procured from this module can be re-usable for future cases.

The example with a small peptide in the example folder is a good way to get introduced to IDPConfGen. Although building other IDP conformer ensembles use the same workflow as the previous example, we will go over more detailed usage examples with a well studied IDP, the unfolded state of the drkN SH3 domain.

Chemical shift data for the unfolded state of the drkN SH3 domain (BMRB ID: 25501) has been already processed with δ2D and CheSPI and secondary structure propensity calculations can be found in example/drksh3_example as drk_d2D.txt and probs8_25501_unfolded.txt respectively for δ2D and CheSPI output.

An extensive culled list is in cull.tar. Unpack it with:

tar -xf cull.tar

This is the same culled list used in the IDPConfGen main publication. However feel free to choose your own from the Dunbrack PISCES database.

Steps from now on will assume you’re in the working directory of example/drksh3_examples.

To initialize the database if you do not already have one, we must download the PDB files from our culled list (can be found in the supplemental package in the IDPConformerGenerator paper):

idpconfgen pdbdl cullpdb_pc90_res2.0_R0.25_d201015_chains24003 -u -n -d pdbs.tar

Next we will create temporary files storing the secondary structure information for each PDB file downloaded. Later to be processed for their torsion angles:

idpconfgen sscalc pdbs.tar -rd -n -cmd <DSSP EXEC>

Please note that since IDPConfGen is a toolkit, many of these modules can be used with custom folders or .tar files.

Finally, torsion angles are extracted and the database we will use for future calculations can be created with the torsions subclient:

idpconfgen torsions sscalc_splitted.tar -sc sscalc.json -n -o idpconfgen_database.json

Now we’re ready to construct multiple conformer ensembles for the unfolded states of the drkN SH3 domain. To build 100 conformers, sampling only the loop region, the default limits to the backbone and side chain L-J energy potentials are be 100 kJ and 250 kJ respectively, using default fragment sizes, no substitutions, and to have side chains added with FASPR:

idpconfgen build \
    -db idpconfgen_database.json \
    -seq drksh3.fasta \
    -nc 100 \
    -of ./drk_L+_nosub_faspr \
    -n

idpconfgen is deterministic. Therefore, the random seed defines the sampling progression - read here for more information.

To switch the side chain building algorithm to MCSCE (recommended), you would first have to install MCSCE. Please re-visit the installation page to get MCSCE set up. Here’s the following example:

idpconfgen build \
    -db idpconfgen_database.json \
    -seq drksh3.fasta \
    -nc 100 \
    -scm mcsce \
    -of ./drk_L+_nosub_mcsce \
    -n

Note

Running MCSCE within IDPConformerGenerator can be memory (RAM) intensive. Consider running with a lower number of CPU threads using the -n flag if necessary.

The defaults for --mcsce-n_trials is 16 while using the --mcsce-mode exhaustive, however we recommend trials larger or equal to 100 for smaller conformer pools. In this exercise, we will be using the default MCSCE side chain building mode simple.

If you’re encountering an error with MCSCE running interally through IDPConformerGenerator, we recommend you to generate backbones first, then pack sidechains after. For example, these would be the commands, required to generate backbones first and then sidechains.:

idpconfgen build \
    -db idpconfgen_database.json \
    -seq drksh3.fasta \
    -nc 100 \
    -dsd \
    -of ./drk_L+_nosub_bb \
    -n

# Make the output folder for MCSCE
mkdir ./drk_L+_nosub_mcsce

# Run MCSCE in the idpconfgen environment because it's already installed
mcsce ./drk_L+_nosub_bb 100 \
    -w \
    -o ./drk_L+_nosub_mcsce \
    -s \
    -m simple \
    -l drk_L+_nosub_mcsce_log

As stated in the idpconfgen build -h, sampling using other secondary structure parameters required --dloop to be turned off --dloop-off. For example, if we’d like to sample only helices and extended strands:

idpconfgen build \
    -db idpconfgen_database.json \
    -seq drksh3.fasta \
    -nc 100 \
    -et 'pairs' \
    --dstrand \
    --dhelix \
    --dloop-off \
    -of ./drk_H+E+_nosub \
    -n

For sampling loops, helices, and strands, we would specify --dhelix --dstrand where --dloop is turned on by default. However, sampling without biasing for secondary structure can be done with --dany --dloop-off.

To sample using custom secondary structure sampling (CSSS) a CSSS database (.JSON) file needs to be created specifying the secondary structure probabilities for each residue. This can be done using the makecsss module if chemical shift data is not readily available, if you’d like to edit a pre-existing CSSS.JSON, or create a new file. Here’s an example for making a custom CSSS.JSON file that samples only helices for residues 15-25 of the unfolded state of the drkN SH3 domain and loops for everything else:

idpconfgen makecsss -cp 1-14 L 1.0|15-25 H 1.0|26-59 L 1.0 -o cust_csss_drk.json

If chemical shift files are readily available, consider using CheSPI or δ2D to generate the CSSS.JSON. δ2D predictions have been included in the example/drksh3_ex_resources folder as drk_d2D.txt. CheSPI probs8_* predictions have been included in the example/drksh3_ex_resources folder as probs8_25501_unfolded.txt.

To convert output from δ2D to CSSS, use the csssconv subclient with flag -d2D:

idpconfgen csssconv -d2D drk_d2D.txt -o csss_drk_d2D.json

To convert output from CheSPI to CSSS, use the csssconv subclient with flag -p8:

idpconfgen csssconv -p8 probs8_25501_unfolded.txt -o csss_drk_chespi.json

The outputted csss_*.json files will be used for the -csss flag in the build subclient. For example, constructing 100 conformers for the unfolded state of the drkN SH3 domain using the δ2D predictions and the same settings for energy and MCSCE as above:

idpconfgen build \
    -db idpconfgen_database.json \
    -seq drksh3.fasta \
    -nc 100 \
    -csss csss_drk_d2D.json \
    --dloop-off \
    -et 'pairs' \
    -of ./drk_CSSSd2D_nosub \
    -n

The default fragment size probabilities for building are (1, 1, 3, 3, 2) for fragment sizes of (1, 2, 3, 4, 5) respectively. To change this, we would have to create a .TXT file with two columns, the first specifying what fragment sizes from lowest to highest, the second specifying their relative probabilities. We have provided an example in example/drksh3_ex_resources as customFragments.txt. To use these custom fragment size probabilities with CSSS:

idpconfgen build \
    -db idpconfgen_database.json \
    -seq drksh3.fasta \
    -nc 100 \
    -xp customFragments.txt \
    -csss csss_drk_d2D.txt \
    --dloop-off \
    -et 'pairs' \
    -of ./drk_fragN_CSSSd2D_nosub \
    -n

Finally, to expand torsion angle sampling beyond the residue identity, we can provide a residue tolerance map using the -urestol flag in the build subclient. For this example, we will be using columns 5, 3, and 2 from the EDSSMat50 substitution matrix:

idpconfgen build \
    -db idpconfgen_database.json \
    -seq drksh3.fasta \
    -nc 100 \
    --dany \
    --dloop-off \
    -urestol '{"R":"RK","D":"DE","C":"CY","C":"CW","Q":"QH","E":"ED","H":"HYQ","I":"IVM","I":"IL","K":"KR","M":"MI","M":"MVL","F":"FY","F":"FWL","W":"WYFC","Y":"YF","Y":"YC","Y":"YWH"}' \
    -et 'pairs' \
    -of ./drk_ANY_sub532 \
    -n

Please note for the above run, we are sampling the torsion angle database disregarding secondary structure with the --dany flag.

Hopefully this more in-depth realistic example with the unfolded state of the drkN SH3 domain has provided you with the utilities and usage examples to explore IPDConfGen more with your custom protein systems.

Modeling Disordered Region Tails on a Folded Domain¶

Note

When modeling multi-chain complexes with the ldrs subclient, the FASTA file format for the -seq parameter must be as follows with no blank spaces.

>A Sequence for chain A >B Sequence for chain B

If you would like to skip a chain while modeling multi-chain complexes, you must have the identical sequence in the .fasta file to the chains in the template you would like to skip.

Clash-checking and will be done with skipped-chains in consideration.

The following example will walk you through building N-terminal and C-terminal IDR tails on folded regions using the Local Disordered Region Sampling (LDRS ldrs subclient).

For this exercise, we will be constructing the tails on the human CNOT7 deadenylase protein. Please enter the example example/cnot7_example folder where you will find the complete CNOT7 sequence: cnot7.fasta, and a PDB of the folded region from PDB ID 4GMJ: 4GMJ_CNOT7.pdb.

Note

If your input PDB has phosphorylated residues such as phosphorylated threonine and serine, please change the three-letter code in the PDB file indicating the residue label to the non-modified version. For example: TPO phosphorylated threonine will become THR and SEP phosphorylated serine will become SER.

Using the resre subclient can help you with this.

Steps from now will assume you’re in the example/cnot7_example directory and have already created your preferred reusable IDPConformerGenerator database. For instructions on the database, please visit the previous exercise “A Real Case Scenario”.

To generate the disordered terminal tails on CNOT7 run the following command:

idpconfgen ldrs \
    -db <PATH TO DATABASE.JSON> \
    -seq cnot7.fasta \
    -etbb 100 \
    -etss 250 \
    -nc 100 \
    -fld 4GMJ_CNOT7.pdb \
    -of ./cnot7_ldrs_L+_faspr \
    -n

The ldrs subclient with automatically detect the N-IDR and C-IDR tail based on mismatches in the primary sequence of the .fasta file (or input sequence from -seq) and the PDB file of the folded domain. This command took approximately 3 minutes on a single workstation with 64 GB DDR4 RAM and 50 CPU threads (-n 50) clocked at 3.0 GHz.

To check your outputs against what is to be expected for this tutorial section. Please click here and download the archive named cnot7_ldrs_example.zip.

Note

Sidechain clashes may appear if you use the FASPR method for packing on sidechains above. To guarantee no sidechain clashes, we recommended either lowering the steric-clash tolerance using the -tol flag above or generating backbone-only conformers first then packing sidechains later with MCSCE as described below.

To generate backbone-only IDR tails on CNOT7 then pack sidechains on the IDRs with MCSCE. We will be using agnostic secondary structure sampling here with --dany.:

idpconfgen ldrs \
    -db <PATH TO DATABASE.JSON> \
    -seq cnot7.fasta \
    --dloop-off \
    --dany \
    -etbb 100 \
    -dsd \
    -nc 1000 \
    -fld 4GMJ_CNOT7.pdb \
    -of ./cnot7_ldrs_ANY_bb \
    -n

idpconfgen resre \
    ./cnot7_ldrs_ANY_bb/ \
    -of ./cnot7_ldrs_ANY_bb_resre \
    -pt 126:HIP,157:HIP,225:HIP,249:HIP,258:HIP, \
    -n

mkdir cnot7_ldrs_ANY_mcsce

mcsce \
    ./cnot7_ldrs_ANY_bb_resre \
    64 \
    -w \
    -o ./cnot7_ldrs_ANY_mcsce \
    -l ./mcsce_log \
    -s \
    -m simple \
    -f 12-262

Note

You can access the MCSCE software here to ignore folded regions and add post-translational modifications during the sidechain packing process.

If you run into an error with mcsce and your input PDB has histines labled as HIS, please change the three-letter code in the PDB file to HIP to account for all protonation states.

Using the resre subclient like so above can help you with this.

Any other parameters will only impact the disordered regions generated. Additional settings include -tol and -kt, where the former sets a tolerance for clashes between the disordered tail(s) and folded domain while the latter acts as a switch to retain the disordered tail(s) generated in the building process. By default, disordered tail only conformers are deleted after full length conformers are generated.

Modeling Disordered Regions Within Folded Domains¶

The following example will walk you through building an 86-residue-long IDR connecting two folded domains. By now, we expect you to be familiar with the Local Disordered Region Sampling (LDRS ldrs subclient). If not, please visit the cnot7_example.

For this exercise, we will construct the intrinsically disordered region from residues 568-652 on the STAS domain of SLC26A9 (PDB ID 7CH1, UniProt A0A7K9KBT8). In the example/slc26a9_example folder you will find the complete FASTA sequence: SLC26A9_STAS.fasta, and a PDB of the folded region from PDB ID 7CH1: 7CH1_SLC26A9.pdb.

Note

Modeling an IDR between folded regions may take a while depending on various factors such as the length of the IDR to model, the distance between the chain breaks, the location of the chain break, and the presence of folded parts that restrict the growth of the chain.

To continue the tutorial, navigate to the example/slc26a9_example directory. Ensure you have already created your preferred reusable IDPConformerGenerator database. For instructions on the database, please visit the previous exercise, “A Real Case Scenario”.

We will be using the ldrs subclient to model 50 conformations of the intrinsically disordered region. Sidechain clashes may exist if you decide to use the FASPR method for generating sidechains like so below.:

idpconfgen ldrs \
    -db <PATH TO DATABASE.JSON> \
    -seq SLC26A9_STAS.fasta \
    -etbb 100 \
    -etss 250 \
    -nc 50 \
    --dloop-off \
    --dany \
    -fld 7CH1_SLC26A9.pdb \
    -of ./slc26a9_ldrs_ANY_faspr \
    -n

From the .fasta file, the ldrs subclient will automatically identify the N-IDR, the C-IDR, and any IDRs missing between folded domains; and construct those. The speed of Linker-IDR generation varies with sequence length as well as its relative position to the folded domain.

To guarantee no sidechain clashes, we recommend either lowering the steric-clash tolerance using the -tol flag above or generating backbone-only conformers first then packing sidechains later with MCSCE as described below in the Advanced LDRS Usage section.

To check your outputs against what is to be expected for this tutorial section. Please click here and download the archive named slc26a9_ldrs_example.zip.

Advanced LDRS Usage¶

For those more proficient in Python, the modularity of LDRS allows users access to advanced features that we describe here. Such features include profiting from better parallelization that allows modeling longer IDRs in a shorter time (e.g., 221 residues in one or two days). Here, we explain how to use the modularity of IDPConformerGenerator to exploit its total capacity for modeling IDRs by writing two new Python scripts that import IDPConfGen machinery.

For template PDB structures, please ensure they have the element name at the second last column in the PDB file. If you’re unsure about the formatting, you can use the Export Molecule feature in PyMOL. The element name column will be automatically added.

The logic behind the LDRS subclient for modeling and IDR connecting two folded domains assumes that we have an N-IDR-like case at the C-terminal region of the first folded domain and a C-IDR-like case at the N-terminal region of the second domain. Thus, when defining the IDR sequence in the fast file given to the -seq parameter, we need to provide two overlapping residues at each. Those will be “QK” and “LA” in this example.

We have already prepared the IDR sequence for this example; see the SLC26A9_IDR.fasta file. You can perform sequence alignment between the SLC26A9_IDR.fasta file and the slc26a9_example.fasta file to visualize where the IDR fits within the whole protein sequence.

Here is a brief overview of what we will do to speed up the process of closing the chain break (L-IDR) with an all-atom IDR model. We show two scripts you can use as templates for using the LDRS features of IDPConformgerGenerator.

Generate 10,000 backbone-only structures (more = better sampling) of the IDR:

The idea is to have IDPConformerGenerator generate a large library of IDPs that may represent the IDR to model. We are exploiting IDPConformerGenerator’s speed and diversity for generating conformers. The time rate-limiting step here is the next-seeker protocol (see scripts), where we have to compare all of the structures in slc26a9_cterm to all structures in slc26a9_nterm to find our candidates for sidechain addition.

Create the necessary folders for the script to run:

3. Change the paths in the script slc26a9_shortcut.py and run it; always use the idpconfgen Python environment.

4. Use psurgeon() in slc26a9_stitching.py script to attach the all-atom IDR models to the folded domain. The output for this will be in slc26a9_results.

Use the resre module to rename any HIS to HIP that exist after stitching:
Model the sidechains onto the backbone-only L-IDRs stitched onto the folded region generated previously in the results folder using the MCSCE software:

To further save time, especially on a computing cluster, we can split the conformers in the nterm folder and run jobs in parallel or request more workers. Furthermore, the conformers in slc26a9_results can be split to run mcsce in parallel as well. Please note that this shortcut is not a memory-intensive task, so 8 GB of RAM is sufficient to run the next-seeker protocol.

Processing Low-Confidence Predicted Residues¶

Although the ldrs subclient accepts any PDB or mmCIF file to be used as a template. A script has been prepared in this folder /remove_lowconfidence_residues.py to automate the removal of low-confidence predicted residues.

Sample thresholds have been given within the Python script that uses the idpconfgen environment based on the source of each structure prediction algorithm. For example, a threshold of 70 needs to be applied to AlphaFold structures and 0.7 for ESMFold structures.

Please edit the input_file, output_file and threshold variables in the script before running with python remove_lowconfidence_residues.py in the idpconfgen Python environment.

Note

It is sometimes preferable to remove residues manually using a molecular viewer such as PyMOL to avoid ending up with short segments of confident, yet unwanted residues.

For example, 2 “confident” residues in between 10 unconfident residues should be removed as well for optimal performance.

Modeling Disordered Regions in a Multi-Chain Protein Complex¶

The following example will walk you through building intrinsically disordered regions on multiple chains of a protein complex using the Local Disordered Region Sampling (LDRS ldrs subclient).

Note

Please ensure all sequences are within the .FASTA file even for chains you are not interested in LDRS processing. This will help LDRS determine which chain to process.

Please refer to D1D2.fasta for a formating example.

For this exercise, we will be constructing a combination of the three cases of N-IDR, L-IDR, and C-IDR on both chain A and chain B of the crystal structure of the D1D2 sub-complex from the human SNRNP core domain. Please enter the example example/complex_example folder where you will find the complete set of sequences: D1D2.fasta, and a PDB of the complex from the RCSB PDB ID 1B34: 1B34.pdb.

Steps from now will assume you’re in the example/complex_example directory and have already created your preferred reusable IDPConformerGenerator database. For instructions on the database, please visit the previous exercise “A Real Case Scenario”.

Due to the automated process of multi-chain detection and building we can generate a set of 10 structures with a single command:

idpconfgen ldrs \
    -db <PATH TO DATABASE.JSON> \
    -seq D1D2.fasta \
    -nc 10 \
    -fld 1B34.pdb \
    --dloop-off \
    --dany \
    -of ./D1D2_ldrs_ANY_faspr \
    -n

The ldrs subclient with automatically detect the all IDRs and their corresponding chains based on sequence similarity and mismatches in the primary sequence of the .fasta file and the PDB file of the folded domain. This command took approximately an hour on a single workstation with 64 GB DDR4 RAM and 10 CPU threads (-n 10) clocked at 3.0 GHz.

To check your outputs against what is to be expected for this tutorial section. Please click here and download the archive named d1d2_complex_ldrs_example.zip.

Exploring IDPConfGen Analysis Functions¶

Our vision for IDPConformerGenerator as a platform includes the analysis of your database and the PDBs generated by IDPConfGen. To get started, the stats subclient is a quick way to check how many hits for different sequence fragment matches you will find in the database for your protein system of choice. It is also possible to include different secondary structure filters as well as amino-acid substitutions to get a more accurate representation the number of hits in the database for your system:

idpconfgen stats \
    -db idpconfgen_database.json \
    -seq drksh3.fasta \
    --dloop-off \
    --dany \
    -op drk_any \
    -of ./drk_any_dbStats

Another tool to investiagte the database is the search subclient. To use this, you will need a tarball or folder of raw PDBs required from the fetch subclient. The search function goes through the PDB headers to find keywords of your choice and returns the number of hits and their associated PDBIDs in .JSON format:

idpconfgen fetch \
    ../cull100 \
    -d ./cull100pdbs/ \
    -u \
    -n

idpconfgen search \
    -fpdb ./cull100pdbs/ \
    -kw 'thermococcus,pro,beta' \
    -n

After generating conformer ensembles with IDPConfGen, it is possible to do some basic plotting with the integrated plotting flags in the torsions and sscalc subclients. For torsions, you can choose to plot either omega, phi, or psi dihedral angle distributions in a scatter plot format. For sscalc, fractional secondary structure will be plotted in terms of DSSP codes as well as fractions from the alpha, beta, or other regions of the Ramachandran space for your conformers of choice. The following example plots the psi angle distributions and the fractional secondary structure of the drk_CSSSd2D_nosub_mcsce ensemble generated in the previous module:

idpconfgen torsions \
    ./drk_CSSSd2D_nosub_mcsce \
    -deg \
    -n \
    --plot angtype=psi xlabel=drk_residues

To plot the fractional Ramachandran space information:

idpconfgen torsions \
    ./drk_CSSSd2D_nosub_mcsce \
    -deg \
    -n \
    --ramaplot filename=fracDrkRama.png colors=['o', 'b', 'k']

To plot the fractional secondary structure information:

idpconfgen sscalc \
    ./drk_CSSSd2D_nosub_mcsce \
    -u \
    -rd \
    -n \
    --plot filename=dssp_reduced_drk_.png

To see which plotting parameters can be modified, please refer to src/idpconfgen/plotfuncs.py. We have given a short list of modifyable parameters here:

--plot title=<TITLE> title_fs=<TITLE FONT SIZE> xlabel=<X-AXIS LABEL> xlabel_fs=<X-AXIS LABEL FONT SIZE> colors=<LIST_OF_COLORS>

Exploring MCSCE and Int2Cart Integrations¶

Integrating the functions from our collaborators at the Head-Gordon Lab, IDPConformerGenerator has the ability to build with bond geometries derived from a recurrent neural network machine learning model Int2Cart. Furthermore, as we introduced the MCSCE method for building sidechains in the previous modules, we would like to provide some examples on changing the default sidechain settings.

To use the Int2Cart method for bond geometries, the --bgeo-strategy flag needs to be defined with int2cart duringthe building stage:

idpconfgen build \
    -db idpconfgen_database.json \
    -seq drksh3.fasta \
    -etbb 100 \
    -etss 250 \
    -nc 100 \
    -csss csss_drk_d2D.json \
    --dloop-off \
    -et 'pairs' \
    -scm mcsce \
    --bgeo-strategy int2cart \
    -of ./drk_CSSSd2D_nosub_int2cart_mcsce \
    -n

To change the number of trials for MCSCE to optimize success rate and overall speed:

idpconfgen build \
    -db idpconfgen_database.json \
    -seq drksh3.fasta \
    -etbb 100 \
    -etss 250 \
    -nc 100 \
    -csss csss_drk_d2D.json \
    --dloop-off \
    -et 'pairs' \
    -scm mcsce \
    --mcsce-n_trials 64 \
    -of ./drk_CSSSd2D_nosub_32_trials_mcsce \
    -n

Using MCSCE to Add Post-Translational Modifications (PTMs)¶

MCSCE can be used stand-alone if installed correctly. As described in the Bioinformatics publication, MCSCE has the ability to add all-atom PTMs (phosphroylation, methylation, N6-carboxylysine, and hydroxylation) on select residues.

The residue names will have to be changed manually or by using a script. Please refer to the MCSCE readme.rst for the alternative residue names that will be recognized to be PTM’d.

It is important to generate conformers with backbone-only by using the -dsd flag in either build or ldrs modules, followed by independent sidechain packing with MCSCE.

If you generated your conformers using ldrs where a folded domain is fixed, you can fix the sidechains of the template residues so only the IDR backbones will have sidechain packing by providing the folded domain boundaries using the --fix flag in MCSCE. The energy of the entire structure will be taken into consideration within MCSCE however, to ensure no steric clashes.

How to Efficiently Set Jobs up for HPC Clusters¶

Using the sethpc subclient, users can generate bash scripts for SLURM managed systems. Due to architecture of Python’s multiprocessing module, IDPConformerGenerator is unable to utilize the resources of multiple nodes on HPC clusters. However, with sethpc, users are able to request multiple nodes per job and sethpc will automatically generate the SBATCH scripts needed, along with an all*.sh and cancel*.sh script to run/cancel all of the jobs generated with ease.

Please note that on many HPC resources (such as Graham) your queuing priority will not change requesting 5 nodes per job or 1 node per 5 jobs, but this should be confirmed.

If multiple nodes are requested, at the end of all jobs, the merge subclient can be run to merge all of the conformers generated into one folder with the option of modify the naming-pattern for each structure. Please see below for an example of running sethpc and merge.

To request 3 nodes to generate 512,000 structures of the unfolded state of the drkN SH3 domain with 10 hours per node:

idpconfgen sethpc \
    -des ./drk_hpc_jobs/ \
    --account def-username \
    --job-name drk_hpc \
    --nodes 3 \
    --ntasks-per-node 32 \
    --mem 16g \
    --time-per-node 0-10:00:00 \
    --mail-user your@email.com \
    -db idpconfgen_database.json \
    -seq drksh3.fasta \
    -etbb 100 \
    -etss 250 \
    -nc 512000 \
    -csss csss_drk_d2D.json \
    --dloop-off \
    -et 'pairs' \
    -scm mcsce \
    --bgeo-strategy int2cart \
    -of /scratch/user/drk/ \
    -n 32 \
    -rs 12

To merge all of the folders created by the multi-node jobs:

idpconfgen merge \
    -tgt /scratch/user/drk/ \
    -des /scratch/user/drk/drk_CSSSd2D_nosub_multiple_mcsce \
    -pre drk_confs \
    -del

Using IDPConfgen as Python library¶

To use IDPConformerGenerator in your project, import it as a library:

import idpconfgen

From within the Python prompt you can get information on each module, class, and function with help(idpconfgen). You can also access the whole API documentation here at the reference page.

Usage¶

Command-lines¶

IDPConfGen Small Peptide Example¶

Building With Variable Bond-Geometry Strategies¶

A Real Case Scenario¶

Modeling Disordered Region Tails on a Folded Domain¶

Modeling Disordered Regions Within Folded Domains¶

Advanced LDRS Usage¶

Processing Low-Confidence Predicted Residues¶

Modeling Disordered Regions in a Multi-Chain Protein Complex¶

Exploring IDPConfGen Analysis Functions¶

Exploring MCSCE and Int2Cart Integrations¶

Using MCSCE to Add Post-Translational Modifications (PTMs)¶

How to Efficiently Set Jobs up for HPC Clusters¶

Using IDPConfgen as Python library¶

Table of Contents

This Page