This tutorial aims to produce a density-based clustering analysis of a cMD trajectory, employing as features the first 3 PCA vectors that are obtained from the GROMACS tools.
It requires a cMD *xtc trajectory file, a *tpr and a *gro file. In the first part, the trajectory is corrected with gmx trjconv and the PCA vectors are calculated for the protein mainchain. Then, the pca_dbscan_gmm.py uses DBSCAN to identify outliers and performs a clustering analysis with the Gaussian mixture models. Through a kernel density-based method, the script is also able to identify the PCA vectors that correspond to the highest density of frames in a cluster, and output the 5 frames that are closest to the identified point. Then, the frames are extracted from the trajectory to separate *.gro files.
I - Trajectory correction and extraction of the PCA vectors
We start by correcting the cMD trajectory using trjconv (this assumes a truncated octahedron box, change the routine according to your box type):
`# Make a *ndx selection with the region of interest for the analysis.`
`# In this case we can use the mainchain of protein residues within 10 angstrom from the catalytic residues.`
gmx make_ndx -f ref.gro -o act.ndx
`# Correct the PBC of the octahedron box.`
echo 1 0 | gmx trjconv -f trajectory.xtc -s ref.tpr -ur compact -pbc mol -center -o trajectory_pbc.xtc
`# Fit the trajectory relative to the previously created selection.`
echo 27 0 | gmx trjconv -f trajectory_pbc.xtc -s ref.gro -n act.ndx -fit rot+trans -o trajectory_fit.xtc
Then we extract the PCA vectors from the corrected trajectory:
`# Covariance analysis to extract the eigenvectors from the cMD trajectory.`
echo 27 27 | gmx covar -f trajectory_fit.xtc -s ref.gro -n act.ndx -ascii -v eigenvec.trr -last 3
`# Print the resulting PCA vectors to a pdb file.`
echo 27 27 | gmx anaeig -f trajectory_fit.xtc -s ref.gro -n act.ndx -v eigenvec.trr -3d pc.pdb -last 3
Clean up the pc.pdb file to include only the PCA vectors:
cat pc.pdb | head -n -2 | tail -n +6 | awk '{print $6,$7,$8}' > temp && mv temp pc.pdb
II - Clustering of PCA vectors and identification of representative frames
Now we run the pca_dbscan_gmm.py script to obtain the clusters and the representative frames. The pca_dbscan_gmm.py script has the following usage:
python pca_dbscan_gmm.py <data_file> <eps> <min_samples> <n_components>
The <data_file> should be the processed pc.pdb file, <eps> and <min_samples> define the parameters for outlier identification using the DBSCAN method, and <n_components> defines the number of clusters in the Gaussian Mixture Models clustering. The script produces a 3D plot of the PCA vectors, where the outliers are represented as black markers, the frames closest to the highest density points as white markers, and each cluster displays a different color. Additionally, the density distribution curves of each cluster are plotted against each PCA vector, with markers representing the identified frames. Initially try different <eps> and <min_samples> values to see which and how many frames are being identified as outliers. Once you have an adequate number of outliers, try different <n_components> values to identify which number of clusters is more suitable. Also take a look at the kernel density plots to see if the density distributions have a regular shape, and the identified frames lie close to highest density points.
Number of DBSCAN outliers: 29
Total number of clusters (GMM): 4
Cluster 0: 595 frames
Top 5 closest frames for Cluster 0: [ 578 721 681 647 1544]
Cluster 1: 1198 frames
Top 5 closest frames for Cluster 1: [1232 1380 1293 1919 1708]
Cluster 2: 463 frames
Top 5 closest frames for Cluster 2: [114 69 68 64 67]
Cluster 3: 215 frames
Top 5 closest frames for Cluster 3: [2015 2076 2050 2052 2054]
![](kernel_density_plot.png)
A clusters.csv file is outputed with the cluster numbers that each frame corresponds to (outliers belong in the -1 cluster). A frames.dat is ouputed with the top 5 frames that are closest to the highest density point of each cluster.
III - Frame extraction
Use the extract_highdens.py script to extract the identified frames from the trajectory. The extract_highdens.py script usage follows:
python extract_highdens.py <xtc_file> <gro_file> <cluster_indices_file> <output_prefix>