af2rave.feature module

Feature analysis module for af2rave.

class FeatureSelection(input: str | list[str], ref_pdb: str | None = None)[source]

Bases: object

Reads an ensemble of PDB files and performs feature selection.

Parameters:

input – The name(s) of the PDB file from reduced MSA. If a directory is provided, all PDB files in the directory will be loaded. If a list of PDB files is provided, all files will be loaded. It can also load trajectories files as long as it can be loaded by MDTraj.
ref_pdb – The name of the reference structure. If none is provided, the first frame of the input PDB file will be used as the reference.

apply_filter(*args: list[str]) → None[source]

Apply a mask to the trajectory. Each mask is a list of strings which are pdb names to keep. Multiple masks can be applied at once.

Example:

fs.apply_filter(mask)
fs.apply_filter(mask1, mask2)

Parameters:: mask (list[str]) – The mask to apply.
Raises:: ValueError – If the mask is invalid.

property atom_pairs: dict[str, ndarray[tuple[Any, ...], dtype[int64]]]: The atom pairs dictionary. The key is the feature name and the value is the atom pairs.

property feature_array: ndarray[tuple[Any, ...], dtype[float64]]

The feature array, with each feature stacked column-wise.

Returns:: The feature array.
Return type:: np.ndarray[float]

property features: dict[str, ndarray[tuple[Any, ...], dtype[float64]]]: The features dictionary. The key is the feature name and the value is the feature array.

get_chimera_plotscript(feature_name: list[str], add_header: bool = True) → str[source]

Generate a Chimera plotscript to visualize the selected features.

Parameters:

feature_name – A list of feature names to visualize.
add_header – Whether to add the “open xxx.pdb” header to the plotscript.

Returns:

The Chimera plotscript as a string.

Raises:

ValueError – If feature_name is None or contains invalid names.

get_rmsd(selection: str = 'name CA') → dict[str, float][source]

Get the RMSD of the atoms in the selection for each frame in the trajectory. The reference structure is provided in the constructor.

Parameters:: selection – str: The selection string to use to select the atoms.
Returns:: Dictionary of pdb names and their RMSD values. Units: Angstrom.
Return type:: dict[str, float]

property minimum_nonbonded_distance: dict[str, float]

property nonbonded_pairs: list[tuple[int, int]]: The non-bonded atom pairs in the structure.

pca(n_components: int = 2, **kwargs) → tuple[PCA, ndarray[tuple[Any, ...], dtype[float64]]][source]

Perform Principal Component Analysis (PCA) on the selected features.

Parameters:

n_components – The number of principal components to compute.
kwargs – Additional keyword arguments to pass to the PCA constructor.

Returns:

A tuple containing the fitted PCA object and the transformed data.

Raises:

ValueError – If no features are available for PCA.

property pdb_name: list[str]: The list of pdb names.

peptide_bond_filter(mean_cutoff=1.4, std_cutoff=0.2) → list[str][source]

Filter structures with a peptide bond cutoff.

Some AlphaFold2 generated structures have unrealistic backbone structures, often characterized with too long or too short peptide bonds. The mean and standard deviation of the peptide bond lengths are calculated for each structure. If the mean is larger than the cutoff, or the standard deviation is larger than the cutoff, the structure will be filtered out.

Parameters:

mean_cutoff (float) – Maximum allowed mean peptide bond length per structure. Default: 1.4 Angstrom
std_cutoff (float) – Maximum allowed standard deviation of peptide bond length per structure. Default: 0.2 Angstrom

Returns:

The pdb names of the selected structures

Return type:

list[str]

Raises:

ValueError – If no structures meet the cutoff criteria.

property peptide_bond_stats: dict[str, ndarray[tuple[Any, ...], dtype[float64]]]: Get the mean and standard deviation of the peptide bondlengths per structure. A dictionary with the pdb names as keys.

rank_feature(selection: str | tuple[str, str] | list[str | tuple[str, str]] = 'name CA') → tuple[list[str], ndarray[tuple[Any, ...], dtype[float64]]][source]

Rank the features by the coefficient of variation (CV). The argument selection can be:

A string:
Computes all pairs of atoms within the selection.

A tuple of two strings:
Computes all pairs of atoms between the two selections.

A list of strings or tuples:
Computes atom pairs for each selection in the list.

Parameters:

selection – The selection string(s) used to determine atom pairs.

Returns:

names: A list of feature names.
cv: A NumPy array containing the coefficient of variation values.

Raises:

ValueError – If selection is not a valid type.

property ref_pdb: str: The reference pdb name.

regular_space_clustering(feature_name: list[str], min_dist: float, max_centers: int = 100, batch_size: int = 100, randomseed: int = 0) → tuple[ndarray[tuple[Any, ...], dtype[float64]], ndarray[tuple[Any, ...], dtype[int64]]][source]

Performs regular space clustering on the selected dimensions of features.

Parameters:

feature_name (list[str]) – List of feature names to use for clustering.
min_dist (float) – Minimum distance between cluster centers. Unit: Angstrom.
max_centers (int) – Maximum number of cluster centers. Default: 100.
batch_size (int) – Number of points to process in each batch. Default: 100.
randomseed (int) – Random seed for the permutation.

Returns:

A tuple containing:

center (np.ndarray): Cluster center coordinates.
center_id (np.ndarray): Indices of the cluster centers.

Raises:

ValueError – If max_centers is exceeded.

rmsd_filter(selection='name CA', rmsd_cutoff: float = 10.0) → list[str][source]

Filter structures with a RMSD cutoff.

Filter structures that are too irrelavant by dropping those with RMSD larger than a cutoff (in Angstrom). This returns a list of pdb names. The filter can be subsequently applied by the apply_filter method.

Parameters:

rmsd_cutoff (float) – The RMSD cutoff value. Default: 10.0 Angstrom
selection (str) – The selection string to the atoms to calculate the RMSD. Default: “name CA”

Returns:

The pdb names of the selected structures

Return type:

list[str]

Raises:

ValueError – If no structures meet the cutoff criteria.

steric_clash_filter(min_non_bonded_cutoff=1.1) → list[str][source]

Filter structures based on non-bonded heavy atom distances.

Some AlphaFold2-generated structures have steric clashes between non-bonded atoms. This method filters out structures where non-bonded heavy atom distances are too short, leading to overlap in van der Waals radii.

Parameters:: min_non_bonded_cutoff (float) – Minimum allowed non-bonded heavy atom distance. Default: 1.1 Angstrom
Returns:: The pdb names of the selected structures
Return type:: list[str]
Raises:: ValueError – If no structures meet the cutoff criteria.

property top: Topology: The topology of the reference structure.

property traj: Trajectory

Return a MDTraj object of all structures.

Returns:: The MDTraj object.
Return type:: md.Trajectory

af2rave.feature.utils

atom_name(top: Topology, index: int) → str[source]

Get the atom name by its index. Example: “CA”.

Parameters:

top – The topology object.
index – The index of the atom.

Returns:

The atom name.

chain(top: Topology, index: int) → str[source]

Get the chain ID by atom index. Example: “A”.

Parameters:

top – The topology object.
index – The index of the atom.

Returns:

The chain ID.

chimera_representation(top: Topology, index: int) → str[source]

Get the ChimeraX representation of the atom by its index.

Format example: “/A:1@CA” for Gly-1 in chain A, CA atom.

Parameters:

top – The topology object.
index – The index of the atom.

Returns:

The ChimeraX representation of the atom.

representation(top: Topology, index: int) → str[source]

Get a formatted atom representation.

Format example: “GLY1A-CA” for Gly-1 in chain A, CA atom.

Parameters:

top – The topology object.
index – The index of the atom.

Returns:

The formatted atom representation.

resid(top: Topology, index: int) → int[source]

Get the residue ID by atom index. This residue ID starts from 1.

Parameters:

top – The topology object.
index – The index of the atom.

Returns:

The residue ID.

resname(top: Topology, index: int) → str[source]

Get the residue name by atom index. Example: “GLY”.

Parameters:

top – The topology object.
index – The index of the atom.

Returns:

The residue name.