Tutorial for a 3D Clustering analysis using PCA vectors from a cMD trajectory

This tutorial aims to produce a density-based clustering analysis of a cMD trajectory, employing as features the first 3 PCA vectors that are obtained from the GROMACS tools.

It requires a cMD *xtc trajectory file, a *tpr and a *gro file. In the first part, the trajectory is corrected with gmx trjconv and the PCA vectors are calculated for the protein mainchain. Then, the pca_dbscan_gmm.py uses DBSCAN to identify outliers and performs a clustering analysis with the Gaussian mixture models. Through a kernel density-based method, the script is also able to identify the PCA vectors that correspond to the highest density of frames in a cluster, and output the 5 frames that are closest to the identified point. Then, the frames are extracted from the trajectory to separate *.gro files.

I - Trajectory correction and extraction of the PCA vectors

We start by correcting the cMD trajectory using trjconv (this assumes a truncated octahedron box, change the routine according to your box type):

# Make a *ndx selection with the region of interest for the analysis.
# In this case we can use the mainchain of protein residues within 10 angstrom from the catalytic residues.
gmx make_ndx -f ref.gro -o act.ndx

# Correct the PBC of the octahedron box.
echo 1 0 | gmx trjconv -f trajectory.xtc -s ref.tpr -ur compact -pbc mol -center -o trajectory_pbc.xtc

# Fit the trajectory relative to the previously created selection.
echo 27 0 | gmx trjconv -f trajectory_pbc.xtc -s ref.gro -n act.ndx -fit rot+trans -o trajectory_fit.xtc

Then we extract the PCA vectors from the corrected trajectory:

# Covariance analysis to extract the eigenvectors from the cMD trajectory.
echo 27 27 | gmx covar -f trajectory_fit.xtc -s ref.gro -n act.ndx -ascii -v eigenvec.trr -last 3 

# Print the resulting PCA vectors to a pdb file.
echo 27 27 | gmx anaeig -f trajectory_fit.xtc -s ref.gro -n act.ndx -v eigenvec.trr -3d pc.pdb -last 3

Clean up the pc.pdb file to include only the PCA vectors:

cat pc.pdb | head -n -2 | tail -n +6 | awk '{print $6,$7,$8}' > temp && mv temp pc.pdb

II - Clustering of PCA vectors and identification of representative frames

Now we run the pca_dbscan_gmm.py script to obtain the clusters and the representative frames. The pca_dbscan_gmm.py script has the following usage:

python pca_dbscan_gmm.py <data_file> <eps> <min_samples> <n_components>

The <data_file> should be the processed pc.pdb file, <eps> and <min_samples> define the parameters for outlier identification using the DBSCAN method, and <n_components> defines the number of clusters in the Gaussian Mixture Models clustering. The script produces a 3D plot of the PCA vectors, where the outliers are represented as black markers, the frames closest to the highest density points as white markers, and each cluster displays a different color. Additionally, the density distribution curves of each cluster are plotted against each PCA vector, with markers representing the identified frames. Initially try different <eps> and <min_samples> values to see which and how many frames are being identified as outliers. Once you have an adequate number of outliers, try different <n_components> values to identify which number of clusters is more suitable. Also take a look at the kernel density plots to see if the density distributions have a regular shape, and the identified frames lie close to highest density points.

Number of DBSCAN outliers: 29
Total number of clusters (GMM): 4
Cluster 0: 595 frames
Top 5 closest frames for Cluster 0: [ 578  721  681  647 1544]
Cluster 1: 1198 frames
Top 5 closest frames for Cluster 1: [1232 1380 1293 1919 1708]
Cluster 2: 463 frames
Top 5 closest frames for Cluster 2: [114  69  68  64  67]
Cluster 3: 215 frames
Top 5 closest frames for Cluster 3: [2015 2076 2050 2052 2054]

A clusters.csv file is outputed with the cluster numbers that each frame corresponds to (outliers belong in the -1 cluster). A frames.dat is ouputed with the top 5 frames that are closest to the highest density point of each cluster.

III - Frame extraction

Use the extract_highdens.py script to extract the identified frames from the trajectory. The extract_highdens.py script usage follows:

python extract_highdens.py <xtc_file> <gro_file> <cluster_indices_file> <output_prefix>