Water Clustering

Determination of conserved water positions based on clustering of oxygen atoms.

Overview of WaterClustering class

ConservedWaterSearch.water_clustering.WaterClustering

Class for performing water clustering.

ConservedWaterSearch.water_clustering.WaterClustering.__init__

Initialise WaterClustering class.

ConservedWaterSearch.water_clustering.WaterClustering.run

Run water clustering.

ConservedWaterSearch.water_clustering.WaterClustering.multi_stage_reclustering

Multi Stage ReClustering (MSRC) procedure.

ConservedWaterSearch.water_clustering.WaterClustering.quick_multi_stage_reclustering

Quick Multi Stage ReClustering (QMSRC) procedure.

ConservedWaterSearch.water_clustering.WaterClustering.single_clustering

Single clustering procedure.

ConservedWaterSearch.water_clustering.WaterClustering.save_results

Saves clustering results and parameters to a file.

ConservedWaterSearch.water_clustering.WaterClustering.restart_cluster

Read the options and results and restart the clustering procedure.

ConservedWaterSearch.water_clustering.WaterClustering.read_and_set_water_clust_options

Reads clustering options from file.

ConservedWaterSearch.water_clustering.WaterClustering.create_from_file

Create a WaterClustering class from a file.

ConservedWaterSearch.water_clustering.WaterClustering.create_from_files_and_restart

Create a WaterClustering from file and restart the procedure.

ConservedWaterSearch.water_clustering.WaterClustering.visualise_nglview

Visualise the results using nglview.

ConservedWaterSearch.water_clustering.WaterClustering.visualise_pymol

Visualise results using pymol.

ConservedWaterSearch.water_clustering.WaterClustering.waterH1

Coordinates of first Hydrogen atom of water molecules from clustering.

ConservedWaterSearch.water_clustering.WaterClustering.waterH2

Coordinates of first Hydrogen atom of water molecules from clustering.

ConservedWaterSearch.water_clustering.WaterClustering.waterO

Oxygen coordinates of water molecules classified using clustering.

ConservedWaterSearch.water_clustering.WaterClustering.water_type

List containing conserved water type classifications.

ConservedWaterSearch.water_clustering.WaterClustering.water_clusters

A single list containing main results.

class ConservedWaterSearch.water_clustering.WaterClustering(nsnaps: int, clustering_algorithm: str = 'OPTICS', water_types_to_find: tuple[str] | list[str] = ('FCW', 'HCW', 'WCW'), restart_after_found: bool = False, min_samples: list[int] | None = None, xis: tuple[float] | list[float] = (0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 1e-05), numbpct_oxygen: float = 0.8, normalize_orientations: bool = True, numbpct_hyd_orient_analysis: float = 0.85, kmeans_ang_cutoff: float = 120, kmeans_inertia_cutoff: float = 0.4, FCW_angdiff_cutoff: float = 5, FCW_angstd_cutoff: float = 17, other_waters_hyd_minsamp_pct: float = 0.15, nonFCW_angdiff_cutoff: float = 15, HCW_angstd_cutoff: float = 17, WCW_angstd_cutoff: float = 20, weakly_explained: float = 0.7, xiFCW: tuple[float] | list[float] = (0.03,), xiHCW: tuple[float] | list[float] = (0.05, 0.01), xiWCW: tuple[float] | list[float] = (0.05, 0.001), njobs: int = 1, verbose: int = 0, debugO: int = 0, debugH: int = 0, plotend: bool = False, plotreach: bool = False, restart_data_file: str | None = None, output_file: str | None = None)[source]

Bases: object

Class for performing water clustering.

First, oxygens are clustered using OPTICS or HDBSCAN, followed by analysis of orientations for classification of waters into one of 3 proposed conserved water types (for more information see Theory, Background, and Methods):

  • FCW (Fully Conserved Water): hydrogens are strongly oriented in two directions with angle of 104.5

  • HCW (Half Conserved Water): one set (cluster) of hydrogens is oriented in a single direction and other hydrogen’s orientations are spread into different orientations with angle of 104.5

  • WCW (Weakly Conserved Water): several orientation combinations exist with satisfying water angles

To run the calculation use either multi_stage_reclustering() function to start Multi Stage ReClustering (MSRC) procedure or single_clustering() to start a single clustering (SC) procedure. MSRC produces better results at the cost of computational time, while SC is very quick but results are worse and significant amount of waters might not be identified at all. For more details see [TFJB22].

__init__(nsnaps: int, clustering_algorithm: str = 'OPTICS', water_types_to_find: tuple[str] | list[str] = ('FCW', 'HCW', 'WCW'), restart_after_found: bool = False, min_samples: list[int] | None = None, xis: tuple[float] | list[float] = (0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 1e-05), numbpct_oxygen: float = 0.8, normalize_orientations: bool = True, numbpct_hyd_orient_analysis: float = 0.85, kmeans_ang_cutoff: float = 120, kmeans_inertia_cutoff: float = 0.4, FCW_angdiff_cutoff: float = 5, FCW_angstd_cutoff: float = 17, other_waters_hyd_minsamp_pct: float = 0.15, nonFCW_angdiff_cutoff: float = 15, HCW_angstd_cutoff: float = 17, WCW_angstd_cutoff: float = 20, weakly_explained: float = 0.7, xiFCW: tuple[float] | list[float] = (0.03,), xiHCW: tuple[float] | list[float] = (0.05, 0.01), xiWCW: tuple[float] | list[float] = (0.05, 0.001), njobs: int = 1, verbose: int = 0, debugO: int = 0, debugH: int = 0, plotend: bool = False, plotreach: bool = False, restart_data_file: str | None = None, output_file: str | None = None) None[source]

Initialise WaterClustering class.

The input parameters determine the options for oxygen clustering and hydrogen orientation analysis if applicable.

Parameters:
  • nsnaps (int) – Number of trajectory snapshots related to the data set.

  • clustering_algorithm (str, optional) – Options are “OPTICS” or “HDBSCAN”. OPTICS provides slightly better results, but is also slightly slower. Defaults to “OPTICS”.

  • water_types_to_find (tuple[str], optional) – Defines which water types to search for. Any combination of “FCW”, “HWC” and “WCW” is allowed, or “onlyO” for oxygen clustering only. Defaults to (“FCW”, “HCW”, “WCW”).

  • restart_after_found (bool, optional) – If True restarts clustering after each water is found. False will give the quick version of multi-stage reclustering approach. Defaults to False.

  • min_samples (list[int], optional) – List of minimum samples for OPTICS or HDBSCAN. If None following range is used [int(0.25 * nsnaps), nsnaps] is used. For single clustering users should provide a single integer between 0 and nsnaps in a list. Defaults to None.

  • xis (tuple[float], optional) – List or tuple of xis for OPTICS clustering. This is ignored for HDBSCAN. Defaults to (0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00001). For single clustering, users should provide a single float between 0 and 1 in a list/tuple.

  • numbpct_oxygen (float, optional) – Percentage of nsnaps required for oxygen cluster to be considered valid and water conserved. The check is enforced on the lower limit nsnaps * numbpct_oxygen as well as the upper limit nsnaps * (2-numbpct_oxygen). Defaults to 0.8.

  • normalize_orientations (bool, optional) – If orientations should be normalised to unit length or not. Defaults to True.

  • numbpct_hyd_orient_analysis (float, optional) – Minimum allowed size of the hydrogen orientation cluster. Defaults to 0.85.

  • kmeans_ang_cutoff (float, optional) – Maximum value of angle (in deg) allowed for FCW in kmeans clustering to be considered correct water angle. Defaults to 120.

  • kmeans_inertia_cutoff (float, optional) – upper limit allowed on kmeans inertia (measure of spread of data in a cluster). Defaults to 0.4.

  • FCW_angdiff_cutoff (float, optional) – Maximum value of angle (in deg) allowed for FCW in OPTICS/HDBSCAN clustering to be considered correct water angle. Defaults to 5.

  • FCW_angstd_cutoff (float, optional) – Maximal standard deviation of angle distribution of orientations of two hydrogens allowed for water to be considered FCW. Defaults to 17.

  • other_waters_hyd_minsamp_pct (float, optional) – Minimum samples to choose for OPTICS or HDBSCAN clustering as percentage of number of water molecules considered for HCW and WCW. Defaults to 0.15.

  • nonFCW_angdiff_cutoff (float, optional) – Maximum standard deviation of angle allowed for HCW and WCW to be considered correct water angle. Defaults to 15.

  • HCW_angstd_cutoff (float, optional) – Maximum standard deviation cutoff for WCW angles to be considered correct water angles. Defaults to 17.

  • WCW_angstd_cutoff (float, optional) – Maximum standard deviation cutoff for WCW angles to be considered correct water angles. Defaults to 20.

  • weakly_explained (float, optional) – percentage of explained hydrogen orientations for water to be considered WCW. Defaults to 0.7.

  • xiFCW (tuple[float], optional) – Xi value for hydrogen clustering of FCWs for OPTICS algorithm. Avoid changing the defaults if possible. Defaults to (0.03,).

  • xiHCW (tuple[float], optional) – Xi value for OPTICS clustering for HCW. Avoid changing the defaults if possible. Defaults to (0.05, 0.01).

  • xiWCW (tuple[float], optional) – Xi value for OPTICS clustering for WCW. Avoid changing the defaults if possible. Defaults to (0.05, 0.001).

  • njobs (int, optional) – how many cpu cores to use for clustering. Defaults to 1.

  • verbose (int, optional) – verbosity of output. Defaults to 0.

  • debugO (int, optional) – debug level for oxygen clustering.

  • debugH (int, optional) – debug level for orientations. Defaults to 0.

  • plotend (bool, optional) – weather to plot everything at end of run. Defaults to False.

  • plotreach (bool, optional) – weather to plot the reachability plot for OPTICS when debugging. Defaults to False.

  • restart_data_file (str, optional) – Restart data file. If None restarting is not possible and no restart file is generated. Both restart_data_file and output_file have to be provided for clustering restarting. Defaults to None.

  • output_file (str | None, optional) – If None results are not saved to a file. If string is provided results (including temporary results) are saved to a file with that name. Both restart_data_file and output_file have to be provided for clustering restarting. Defaults to None.

run(oxygen_positions, hydrogen1_positions, hydrogen2_positions)[source]

Run water clustering.

Results will be stored in self.water_clusters.

Parameters:
  • oxygen_positions (np.ndarray) – Oxygen coordinates.

  • hydrogen1_positions (np.ndarray) – Hydrogen 1 orientations.

  • hydrogen2_positions (np.ndarray) – Hydrogen 2 orientations.

multi_stage_reclustering(Odata: ndarray, H1: ndarray | None, H2: ndarray | None, clustering_algorithm: str = 'OPTICS', lower_minsamp_pct: float = 0.25, every_minsamp: int = 1, xis: list[float] | None = None, whichH: list[str] | None = None) None[source]

Multi Stage ReClustering (MSRC) procedure.

Main loop - loops over water clustering parameter space (minsamp and xi) and clusters oxygens first - if a clustering with satisfactory oxygen clustering and hydrogen orientation clustering (optional) is found, elements of that water cluster are removed from the data set and water clustering starts from the beginning. Loops until no satisfactory clusterings are found. For more details see [TFJB22].

Parameters:
  • Odata (np.ndarray) – Oxygen coordinates.

  • H1 (np.ndarray | None) – Hydrogen 1 orientations. If None whichH must be “onlyO”.

  • H2 (np.ndarray | None) – Hydrogen 2 orientations. If None whichH must be “onlyO”.

  • clustering_algorithm (str, optional) – Options are “OPTICS” or “HDBSCAN”. OPTICS provides slightly better results, but is also slightly slower. Defaults to “OPTICS”.

  • lower_minsamp_pct (float, optional) – Lowest minsamp value used for clustering. The range is from nsnaps to lower_minsamp_pct times nsnaps. Defaults to 0.25.

  • every_minsamp (int, optional) – Step for sampling of minsamp in range from nsnaps to lower_minsamp_pct times nsnaps. If 1 uses all integer values in range. Defaults to 1.

  • xis (list[float], optional) – List of xis for OPTICS clustering. This is ignored for HDBSCAN. Defaults to [ 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00001].

  • whichH (list[str], optional) – Defines which water types to search for. Any combination of “FCW”, “HWC” and “WCW” is allowed, or “onlyO” for oxygen clustering only. Defaults to [“FCW”, “HCW”, “WCW”].

quick_multi_stage_reclustering(Odata: ndarray, H1: ndarray | None, H2: ndarray | None, clustering_algorithm: str = 'OPTICS', lower_minsamp_pct: float = 0.25, every_minsamp: int = 1, xis: list[float] | None = None, whichH: list[str] | None = None) None[source]

Quick Multi Stage ReClustering (QMSRC) procedure.

Main loop - loops over water clustering parameter space (minsamp and xi) and clusters oxygens first - clusters with satisfactory oxygen clustering and hydrogen orientation clustering (optional) are found, elements of those water cluster are added to the list of conserved waters. The data for those waters is removed from the data set but clustering does not restart.

Parameters:
  • Odata (np.ndarray) – Oxygen coordinates.

  • H1 (np.ndarray | None) – Hydrogen 1 orientations. If None whichH must be “onlyO”.

  • H2 (np.ndarray | None) – Hydrogen 2 orientations. If None whichH must be “onlyO”.

  • clustering_algorithm (str, optional) – Options are “OPTICS” or “HDBSCAN”. OPTICS provides slightly better results, but is also slightly slower. Defaults to “OPTICS”.

  • lower_minsamp_pct (float, optional) – Lowest minsamp value used for clustering. The range is from nsnaps to lower_minsamp_pct times nsnaps. Defaults to 0.25.

  • every_minsamp (int, optional) – Step for sampling of minsamp in range from nsnaps to lower_minsamp_pct times nsnaps. If 1 uses all integer values in range. Defaults to 1.

  • xis (list[float], optional) – List of xis for OPTICS clustering. This is ignored for HDBSCAN. Defaults to [ 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00001].

  • whichH (list[str], optional) – Defines which water types to search for. Any combination of “FCW”, “HWC” and “WCW” is allowed, or “onlyO” for oxygen clustering only. Defaults to [“FCW”, “HCW”, “WCW”].

single_clustering(Odata: ndarray, H1: ndarray | None, H2: ndarray | None, clustering_algorithm: str = 'OPTICS', minsamp: int | None = None, xi: float | None = None, whichH: list[str] | None = None) None[source]

Single clustering procedure.

In single clustering procedure oxygen clustering is run only once with given minsamp and xi (if applicable - only for OPTICS).

Parameters:
  • Odata (np.ndarray) – Oxygen coordinates.

  • H1 (np.ndarray | None) – Hydrogen 1 orientations. If None whichH must be “onlyO”.

  • H2 (np.ndarray | None) – Hydrogen 2 orientations. If None whichH must be “onlyO”.

  • clustering_algorithm (str, optional) – Options are “OPTICS” or “HDBSCAN”. OPTICS provides slightly better results, but is also slightly slower. Defaults to “OPTICS”.

  • minsamp (int | None, optional) – Minimum samples parameter for OPTICS or HDBSCAN. If None numbpct_oxygen * nsnaps is used. Defaults to None.

  • xi (float | None, optional) – Xi value for OPTICS. If None value of 0.05 is used. If clustering_algorithm is HDBSCAN its ignored. Defaults to None.

  • whichH (list[str], optional) – Defines which water types to search for. Any combination of “FCW”, “HWC” and “WCW” is allowed, or “onlyO” for oxygen clustering only. Defaults to [“FCW”, “HCW”, “WCW”].

save_results(file_name: str) None[source]

Saves clustering results and parameters to a file.

Top of the results file contains clustering parameters after which results are saved in the same file.

Parameters:

file_name (str) – File name to save results to.

restart_cluster(partial_results_file: str, partial_data_file: str) None[source]

Read the options and results and restart the clustering procedure.

Parameters:
  • partial_data_file (str) – File name of the file containing intermediate set of data of hydrogen and oxygen coordinates.

  • partial_results_file (str) – File name containing partial results with determined water coordinates.

read_and_set_water_clust_options(file_name: str) None[source]

Reads clustering options from file.

Reads all class clustering options from save file and sets the parameters. Reads all parameters except clustering protocol and protocol parameters.

Parameters:

file_name (str) – Results or partial results file from which procedure parameters will be read.

classmethod create_from_file(file_name: str) WaterClustering[source]

Create a WaterClustering class from a file.

Parameters:

file_name (str) – Results or partial results file from which procedure parameters will be read.

Returns:

creates an instance of WaterClustering class by reading options from a file.

classmethod create_from_files_and_restart(partial_output: str, partial_data_file: str) WaterClustering[source]

Create a WaterClustering from file and restart the procedure.

Parameters:
  • partial_output (str) – Partial results file from which procedure parameters will be read.

  • partial_data_file (str) – Partial data file from which data will be read.

Returns:

creates an instance of WaterClustering class and restarts clustering

visualise_pymol(aligned_protein: str | None = None, output_file: str | None = None, active_site_ids: list[int] | None = None, crystal_waters: str | None = None, ligand_resname: str | None = None, dist: float = 10.0, density_map: str | None = None, polar_contacts: bool = False, lunch_pymol: bool = True, reinitialize: bool = True) None[source]

Visualise results using pymol.

Parameters:
  • aligned_protein (str, optional) – file name containing protein configuration trajectory was aligned to. If None only waters are shown. Defaults to None.

  • output_file (str | None, optional) – File to save the visualisation state. If None, a pymol session is started (this probably doesn’t work on Mac OSX). Defaults to None.

  • active_site_ids (list[int] | None, optional) – Residue ids - numbers of aminoacids in active site. These are visualised as licorice. Defaults to None.

  • crystal_waters (str | None, optional) – PDBid from which crystal waters will attempted to be extracted. Defaults to None.

  • ligand_resname (str | None, optional) – Residue name of the ligand around which crystal waters (oxygens) shall be selected. Defaults to None.

  • dist (float) – distance from the centre of ligand around which crystal waters shall be selected. Defaults to 10.0.

  • density_map (str | None, optional) – Water density map to add to visualisation session (usually .dx file). Defaults to None.

  • polar_contacts (bool, optional) – If True polar contacts between waters and protein will be visualised. Defaults to False.

  • lunch_pymol (bool, optional) – If True pymol will be lunched in interactive mode. If False pymol will be imported without lunching. Defaults to True.

  • reinitialize (bool, optional) – If True pymol will be reinitialized (defaults restored and objects cleaned). Defaults to True.

visualise_nglview(aligned_protein: str | None = None, active_site_ids: list[int] | None = None, crystal_waters: str | None = None, density_map: str | None = None) NGLWidget[source]

Visualise the results using nglview.

nglview can be used to visualise the results of the clustering procedure. We recommend using pymol visualisation as it is more informative and provides more options.

Parameters:
  • aligned_protein (str, optional) – File containing protein configuration the original trajectory was aligned to. Defaults to None.

  • active_site_ids (list[int] | None, optional) – Residue ids - numbers of aminoacids in active site. These are visualised as licorice. Defaults to None.

  • crystal_waters (str | None, optional) – PDBid from which crystal waters will attempted to be extracted. Defaults to None.

  • density_map (str | None, optional) – Water density map to add to visualisation session (usually .dx file). Defaults to None.

Returns:

returns nglview instance widget which can be run in Ipyhon/Jupyter to create a visualisation instance

Return type:

NGLWidget

property water_type: list[str]

List containing conserved water type classifications.

Contains conserved water type classifications in the same order as coordinates in waterO and waterH1 and waterH2. Water types:

  • FCW (Fully Conserved Water): hydrogens are strongly oriented in two directions with angle of 104.5

  • HCW (Half Conserved Water): one set (cluster) of hydrogens is oriented in certain directions and other are spread into different orientations with angle of 104.5

  • WCW (Weakly Conserved Water): several orientation combinations exsist with satisfying water angles

For more information see [TFJB22] and Theory, Background, and Methods).

Returns:

Returns a list of strings containing water type classification - “FCW” or “HCW” or “WCW”. If “onlyO”, only oxygen clustering was performed.

Return type:

list[str]

property waterO: list[ndarray]

Oxygen coordinates of water molecules classified using clustering.

Returns:

Returns a list of 3D xyz coordinates of oxygen positions in space

Return type:

list[np.ndarray]

property waterH1: list[ndarray]

Coordinates of first Hydrogen atom of water molecules from clustering.

Returns:

Returns a list of 3D xyz coordinates of first hydrogens’ positions in space

Return type:

list[np.ndarray]

property waterH2: list[ndarray]

Coordinates of first Hydrogen atom of water molecules from clustering.

Returns:

Returns a list of 3D xyz coordinates of second hydrogens’ positions in space

Return type:

list[np.ndarray]

property water_clusters: list[dict]

A single list containing main results.

List of dicts containing coordinates of oxygen and two hydrogens and water classification. Each element in the list is a dictionary that contains keys “O”, “H1”, “H2” and “type” which correspond to oxygen coordinates, hydrogen 1 coordinates, hydrogen 2 coordinates and water classification respectively.