Water Clustering
Determination of conserved water positions based on clustering of oxygen atoms.
Overview of WaterClustering class
- class ConservedWaterSearch.water_clustering.WaterClustering(nsnaps: int, clustering_algorithm: str = 'OPTICS', water_types_to_find: tuple[str] | list[str] = ('FCW', 'HCW', 'WCW'), restart_after_found: bool = False, min_samples: list[int] | None = None, xis: tuple[float] | list[float] = (0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 1e-05), numbpct_oxygen: float = 0.8, normalize_orientations: bool = True, numbpct_hyd_orient_analysis: float = 0.85, kmeans_ang_cutoff: float = 120, kmeans_inertia_cutoff: float = 0.4, FCW_angdiff_cutoff: float = 5, FCW_angstd_cutoff: float = 17, other_waters_hyd_minsamp_pct: float = 0.15, nonFCW_angdiff_cutoff: float = 15, HCW_angstd_cutoff: float = 17, WCW_angstd_cutoff: float = 20, weakly_explained: float = 0.7, xiFCW: tuple[float] | list[float] = (0.03,), xiHCW: tuple[float] | list[float] = (0.05, 0.01), xiWCW: tuple[float] | list[float] = (0.05, 0.001), njobs: int = 1, verbose: int = 0, debugO: int = 0, debugH: int = 0, plotend: bool = False, plotreach: bool = False, restart_data_file: str | None = None, output_file: str | None = None)[source]
Bases:
objectClass for performing water clustering.
First, oxygens are clustered using OPTICS or HDBSCAN, followed by analysis of orientations for classification of waters into one of 3 proposed conserved water types (for more information see Theory, Background, and Methods):
FCW (Fully Conserved Water): hydrogens are strongly oriented in two directions with angle of 104.5
HCW (Half Conserved Water): one set (cluster) of hydrogens is oriented in a single direction and other hydrogen’s orientations are spread into different orientations with angle of 104.5
WCW (Weakly Conserved Water): several orientation combinations exist with satisfying water angles
To run the calculation use either
multi_stage_reclustering()function to start Multi Stage ReClustering (MSRC) procedure orsingle_clustering()to start a single clustering (SC) procedure. MSRC produces better results at the cost of computational time, while SC is very quick but results are worse and significant amount of waters might not be identified at all. For more details see [TFJB22].- __init__(nsnaps: int, clustering_algorithm: str = 'OPTICS', water_types_to_find: tuple[str] | list[str] = ('FCW', 'HCW', 'WCW'), restart_after_found: bool = False, min_samples: list[int] | None = None, xis: tuple[float] | list[float] = (0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 1e-05), numbpct_oxygen: float = 0.8, normalize_orientations: bool = True, numbpct_hyd_orient_analysis: float = 0.85, kmeans_ang_cutoff: float = 120, kmeans_inertia_cutoff: float = 0.4, FCW_angdiff_cutoff: float = 5, FCW_angstd_cutoff: float = 17, other_waters_hyd_minsamp_pct: float = 0.15, nonFCW_angdiff_cutoff: float = 15, HCW_angstd_cutoff: float = 17, WCW_angstd_cutoff: float = 20, weakly_explained: float = 0.7, xiFCW: tuple[float] | list[float] = (0.03,), xiHCW: tuple[float] | list[float] = (0.05, 0.01), xiWCW: tuple[float] | list[float] = (0.05, 0.001), njobs: int = 1, verbose: int = 0, debugO: int = 0, debugH: int = 0, plotend: bool = False, plotreach: bool = False, restart_data_file: str | None = None, output_file: str | None = None) None[source]
Initialise
WaterClusteringclass.The input parameters determine the options for oxygen clustering and hydrogen orientation analysis if applicable.
- Parameters:
nsnaps (int) – Number of trajectory snapshots related to the data set.
clustering_algorithm (str, optional) – Options are “OPTICS” or “HDBSCAN”. OPTICS provides slightly better results, but is also slightly slower. Defaults to “OPTICS”.
water_types_to_find (tuple[str], optional) – Defines which water types to search for. Any combination of “FCW”, “HWC” and “WCW” is allowed, or “onlyO” for oxygen clustering only. Defaults to (“FCW”, “HCW”, “WCW”).
restart_after_found (bool, optional) – If
Truerestarts clustering after each water is found.Falsewill give the quick version of multi-stage reclustering approach. Defaults to False.min_samples (list[int], optional) – List of minimum samples for OPTICS or HDBSCAN. If
Nonefollowing range is used[int(0.25 * nsnaps), nsnaps]is used. For single clustering users should provide a single integer between 0 andnsnapsin a list. Defaults to None.xis (tuple[float], optional) – List or tuple of xis for OPTICS clustering. This is ignored for HDBSCAN. Defaults to (0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00001). For single clustering, users should provide a single float between 0 and 1 in a list/tuple.
numbpct_oxygen (float, optional) – Percentage of
nsnapsrequired for oxygen cluster to be considered valid and water conserved. The check is enforced on the lower limitnsnaps * numbpct_oxygenas well as the upper limitnsnaps * (2-numbpct_oxygen). Defaults to 0.8.normalize_orientations (bool, optional) – If orientations should be normalised to unit length or not. Defaults to True.
numbpct_hyd_orient_analysis (float, optional) – Minimum allowed size of the hydrogen orientation cluster. Defaults to 0.85.
kmeans_ang_cutoff (float, optional) – Maximum value of angle (in deg) allowed for FCW in kmeans clustering to be considered correct water angle. Defaults to 120.
kmeans_inertia_cutoff (float, optional) – upper limit allowed on kmeans inertia (measure of spread of data in a cluster). Defaults to 0.4.
FCW_angdiff_cutoff (float, optional) – Maximum value of angle (in deg) allowed for FCW in OPTICS/HDBSCAN clustering to be considered correct water angle. Defaults to 5.
FCW_angstd_cutoff (float, optional) – Maximal standard deviation of angle distribution of orientations of two hydrogens allowed for water to be considered FCW. Defaults to 17.
other_waters_hyd_minsamp_pct (float, optional) – Minimum samples to choose for OPTICS or HDBSCAN clustering as percentage of number of water molecules considered for HCW and WCW. Defaults to 0.15.
nonFCW_angdiff_cutoff (float, optional) – Maximum standard deviation of angle allowed for HCW and WCW to be considered correct water angle. Defaults to 15.
HCW_angstd_cutoff (float, optional) – Maximum standard deviation cutoff for WCW angles to be considered correct water angles. Defaults to 17.
WCW_angstd_cutoff (float, optional) – Maximum standard deviation cutoff for WCW angles to be considered correct water angles. Defaults to 20.
weakly_explained (float, optional) – percentage of explained hydrogen orientations for water to be considered WCW. Defaults to 0.7.
xiFCW (tuple[float], optional) – Xi value for hydrogen clustering of FCWs for OPTICS algorithm. Avoid changing the defaults if possible. Defaults to (0.03,).
xiHCW (tuple[float], optional) – Xi value for OPTICS clustering for HCW. Avoid changing the defaults if possible. Defaults to (0.05, 0.01).
xiWCW (tuple[float], optional) – Xi value for OPTICS clustering for WCW. Avoid changing the defaults if possible. Defaults to (0.05, 0.001).
njobs (int, optional) – how many cpu cores to use for clustering. Defaults to 1.
verbose (int, optional) – verbosity of output. Defaults to 0.
debugO (int, optional) – debug level for oxygen clustering.
debugH (int, optional) – debug level for orientations. Defaults to 0.
plotend (bool, optional) – weather to plot everything at end of run. Defaults to False.
plotreach (bool, optional) – weather to plot the reachability plot for OPTICS when debugging. Defaults to False.
restart_data_file (str, optional) – Restart data file. If
Nonerestarting is not possible and no restart file is generated. Bothrestart_data_fileandoutput_filehave to be provided for clustering restarting. Defaults to None.output_file (str | None, optional) – If
Noneresults are not saved to a file. If string is provided results (including temporary results) are saved to a file with that name. Bothrestart_data_fileandoutput_filehave to be provided for clustering restarting. Defaults to None.
- run(oxygen_positions, hydrogen1_positions, hydrogen2_positions)[source]
Run water clustering.
Results will be stored in
self.water_clusters.- Parameters:
oxygen_positions (np.ndarray) – Oxygen coordinates.
hydrogen1_positions (np.ndarray) – Hydrogen 1 orientations.
hydrogen2_positions (np.ndarray) – Hydrogen 2 orientations.
- multi_stage_reclustering(Odata: ndarray, H1: ndarray | None, H2: ndarray | None, clustering_algorithm: str = 'OPTICS', lower_minsamp_pct: float = 0.25, every_minsamp: int = 1, xis: list[float] | None = None, whichH: list[str] | None = None) None[source]
Multi Stage ReClustering (MSRC) procedure.
Main loop - loops over water clustering parameter space (minsamp and xi) and clusters oxygens first - if a clustering with satisfactory oxygen clustering and hydrogen orientation clustering (optional) is found, elements of that water cluster are removed from the data set and water clustering starts from the beginning. Loops until no satisfactory clusterings are found. For more details see [TFJB22].
- Parameters:
Odata (np.ndarray) – Oxygen coordinates.
H1 (np.ndarray | None) – Hydrogen 1 orientations. If None
whichHmust be “onlyO”.H2 (np.ndarray | None) – Hydrogen 2 orientations. If None
whichHmust be “onlyO”.clustering_algorithm (str, optional) – Options are “OPTICS” or “HDBSCAN”. OPTICS provides slightly better results, but is also slightly slower. Defaults to “OPTICS”.
lower_minsamp_pct (float, optional) – Lowest minsamp value used for clustering. The range is from
nsnapstolower_minsamp_pcttimesnsnaps. Defaults to 0.25.every_minsamp (int, optional) – Step for sampling of minsamp in range from
nsnapstolower_minsamp_pcttimesnsnaps. If 1 uses all integer values in range. Defaults to 1.xis (list[float], optional) – List of xis for OPTICS clustering. This is ignored for HDBSCAN. Defaults to [ 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00001].
whichH (list[str], optional) – Defines which water types to search for. Any combination of “FCW”, “HWC” and “WCW” is allowed, or “onlyO” for oxygen clustering only. Defaults to [“FCW”, “HCW”, “WCW”].
- quick_multi_stage_reclustering(Odata: ndarray, H1: ndarray | None, H2: ndarray | None, clustering_algorithm: str = 'OPTICS', lower_minsamp_pct: float = 0.25, every_minsamp: int = 1, xis: list[float] | None = None, whichH: list[str] | None = None) None[source]
Quick Multi Stage ReClustering (QMSRC) procedure.
Main loop - loops over water clustering parameter space (minsamp and xi) and clusters oxygens first - clusters with satisfactory oxygen clustering and hydrogen orientation clustering (optional) are found, elements of those water cluster are added to the list of conserved waters. The data for those waters is removed from the data set but clustering does not restart.
- Parameters:
Odata (np.ndarray) – Oxygen coordinates.
H1 (np.ndarray | None) – Hydrogen 1 orientations. If None
whichHmust be “onlyO”.H2 (np.ndarray | None) – Hydrogen 2 orientations. If None
whichHmust be “onlyO”.clustering_algorithm (str, optional) – Options are “OPTICS” or “HDBSCAN”. OPTICS provides slightly better results, but is also slightly slower. Defaults to “OPTICS”.
lower_minsamp_pct (float, optional) – Lowest minsamp value used for clustering. The range is from
nsnapstolower_minsamp_pcttimesnsnaps. Defaults to 0.25.every_minsamp (int, optional) – Step for sampling of minsamp in range from
nsnapstolower_minsamp_pcttimesnsnaps. If 1 uses all integer values in range. Defaults to 1.xis (list[float], optional) – List of xis for OPTICS clustering. This is ignored for HDBSCAN. Defaults to [ 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00001].
whichH (list[str], optional) – Defines which water types to search for. Any combination of “FCW”, “HWC” and “WCW” is allowed, or “onlyO” for oxygen clustering only. Defaults to [“FCW”, “HCW”, “WCW”].
- single_clustering(Odata: ndarray, H1: ndarray | None, H2: ndarray | None, clustering_algorithm: str = 'OPTICS', minsamp: int | None = None, xi: float | None = None, whichH: list[str] | None = None) None[source]
Single clustering procedure.
In single clustering procedure oxygen clustering is run only once with given
minsampandxi(if applicable - only for OPTICS).- Parameters:
Odata (np.ndarray) – Oxygen coordinates.
H1 (np.ndarray | None) – Hydrogen 1 orientations. If None
whichHmust be “onlyO”.H2 (np.ndarray | None) – Hydrogen 2 orientations. If None
whichHmust be “onlyO”.clustering_algorithm (str, optional) – Options are “OPTICS” or “HDBSCAN”. OPTICS provides slightly better results, but is also slightly slower. Defaults to “OPTICS”.
minsamp (int | None, optional) – Minimum samples parameter for OPTICS or HDBSCAN. If None
numbpct_oxygen*nsnapsis used. Defaults to None.xi (float | None, optional) – Xi value for OPTICS. If
Nonevalue of 0.05 is used. Ifclustering_algorithmis HDBSCAN its ignored. Defaults to None.whichH (list[str], optional) – Defines which water types to search for. Any combination of “FCW”, “HWC” and “WCW” is allowed, or “onlyO” for oxygen clustering only. Defaults to [“FCW”, “HCW”, “WCW”].
- save_results(file_name: str) None[source]
Saves clustering results and parameters to a file.
Top of the results file contains clustering parameters after which results are saved in the same file.
- Parameters:
file_name (str) – File name to save results to.
- restart_cluster(partial_results_file: str, partial_data_file: str) None[source]
Read the options and results and restart the clustering procedure.
- read_and_set_water_clust_options(file_name: str) None[source]
Reads clustering options from file.
Reads all class clustering options from save file and sets the parameters. Reads all parameters except clustering protocol and protocol parameters.
- Parameters:
file_name (str) – Results or partial results file from which procedure parameters will be read.
- classmethod create_from_file(file_name: str) WaterClustering[source]
Create a
WaterClusteringclass from a file.- Parameters:
file_name (str) – Results or partial results file from which procedure parameters will be read.
- Returns:
creates an instance of
WaterClusteringclass by reading options from a file.
- classmethod create_from_files_and_restart(partial_output: str, partial_data_file: str) WaterClustering[source]
Create a
WaterClusteringfrom file and restart the procedure.- Parameters:
- Returns:
creates an instance of
WaterClusteringclass and restarts clustering
- visualise_pymol(aligned_protein: str | None = None, output_file: str | None = None, active_site_ids: list[int] | None = None, crystal_waters: str | None = None, ligand_resname: str | None = None, dist: float = 10.0, density_map: str | None = None, polar_contacts: bool = False, lunch_pymol: bool = True, reinitialize: bool = True) None[source]
Visualise results using pymol.
- Parameters:
aligned_protein (str, optional) – file name containing protein configuration trajectory was aligned to. If
Noneonly waters are shown. Defaults to None.output_file (str | None, optional) – File to save the visualisation state. If
None, a pymol session is started (this probably doesn’t work on Mac OSX). Defaults to None.active_site_ids (list[int] | None, optional) – Residue ids - numbers of aminoacids in active site. These are visualised as licorice. Defaults to None.
crystal_waters (str | None, optional) – PDBid from which crystal waters will attempted to be extracted. Defaults to None.
ligand_resname (str | None, optional) – Residue name of the ligand around which crystal waters (oxygens) shall be selected. Defaults to None.
dist (float) – distance from the centre of ligand around which crystal waters shall be selected. Defaults to 10.0.
density_map (str | None, optional) – Water density map to add to visualisation session (usually .dx file). Defaults to None.
polar_contacts (bool, optional) – If True polar contacts between waters and protein will be visualised. Defaults to False.
lunch_pymol (bool, optional) – If True pymol will be lunched in interactive mode. If False pymol will be imported without lunching. Defaults to True.
reinitialize (bool, optional) – If True pymol will be reinitialized (defaults restored and objects cleaned). Defaults to True.
- visualise_nglview(aligned_protein: str | None = None, active_site_ids: list[int] | None = None, crystal_waters: str | None = None, density_map: str | None = None) NGLWidget[source]
Visualise the results using nglview.
nglview can be used to visualise the results of the clustering procedure. We recommend using pymol visualisation as it is more informative and provides more options.
- Parameters:
aligned_protein (str, optional) – File containing protein configuration the original trajectory was aligned to. Defaults to
None.active_site_ids (list[int] | None, optional) – Residue ids - numbers of aminoacids in active site. These are visualised as licorice. Defaults to None.
crystal_waters (str | None, optional) – PDBid from which crystal waters will attempted to be extracted. Defaults to None.
density_map (str | None, optional) – Water density map to add to visualisation session (usually .dx file). Defaults to None.
- Returns:
returns nglview instance widget which can be run in Ipyhon/Jupyter to create a visualisation instance
- Return type:
NGLWidget
- property water_type: list[str]
List containing conserved water type classifications.
Contains conserved water type classifications in the same order as coordinates in
waterOandwaterH1andwaterH2. Water types:FCW (Fully Conserved Water): hydrogens are strongly oriented in two directions with angle of 104.5
HCW (Half Conserved Water): one set (cluster) of hydrogens is oriented in certain directions and other are spread into different orientations with angle of 104.5
WCW (Weakly Conserved Water): several orientation combinations exsist with satisfying water angles
For more information see [TFJB22] and Theory, Background, and Methods).
- property waterO: list[ndarray]
Oxygen coordinates of water molecules classified using clustering.
- Returns:
Returns a list of 3D xyz coordinates of oxygen positions in space
- Return type:
list[np.ndarray]
- property waterH1: list[ndarray]
Coordinates of first Hydrogen atom of water molecules from clustering.
- Returns:
Returns a list of 3D xyz coordinates of first hydrogens’ positions in space
- Return type:
list[np.ndarray]
- property waterH2: list[ndarray]
Coordinates of first Hydrogen atom of water molecules from clustering.
- Returns:
Returns a list of 3D xyz coordinates of second hydrogens’ positions in space
- Return type:
list[np.ndarray]
- property water_clusters: list[dict]
A single list containing main results.
List of dicts containing coordinates of oxygen and two hydrogens and water classification. Each element in the list is a dictionary that contains keys “O”, “H1”, “H2” and “type” which correspond to oxygen coordinates, hydrogen 1 coordinates, hydrogen 2 coordinates and water classification respectively.