Solvers#

Solvers for generating coresets.

class coreax.solvers.Solver[source]#

Bases: Module, Generic[_Coreset, _Data, _State]

Base class for coreset solvers.

Solver is generic on the type of data required by the reduce method, and the type of coreset returned, providing a convenient means to distinguish between solvers that take (weighted) data/supervised data, and those which produce coresets/coresubsets.

abstract reduce(dataset, solver_state=None)[source]#

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:
  • dataset (Data) – The (potentially weighted and supervised) data to generate the coreset from

  • solver_state (Optional[Any]) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[Coreset, Any]

Returns:

a tuple of the solved coreset and intermediate solver state information

class coreax.solvers.CoresubsetSolver[source]#

Bases: Solver[Coresubset[_Data], _Data, _State], Generic[_Data, _State]

Solver which returns a coreax.coreset.Coresubset.

A convenience class for the most common solver type in this package.

class coreax.solvers.RefinementSolver[source]#

Bases: CoresubsetSolver[_Data, _State], Generic[_Data, _State]

A CoresubsetSolver which supports refinement.

Some solvers assume implicitly/explicitly an initial coresubset on which the solution is dependent. Such solvers can be interpreted as refining the initial coresubset to produce another (solution) coresubset.

By providing a ‘refine’ method, one can compose the results of different solvers together, and/or repeatedly apply/chain the result of a refinement based solve.

# An example of repeated application/chaining of solutions/solvers.
result, state = solver.reduce(dataset)
refined_result, state = refine_solver.refine(result, state)
re_refined_result, state = refine_solver.refine(refined_result, state)
abstract refine(coresubset, solver_state=None)[source]#

Refine a coresubset - swap/update coresubset indices.

Parameters:
  • coresubset (Coresubset[Data]) – The coresubset to refine

  • solver_state (Optional[Any]) – Solution state information, primarily used to cache expensive intermediate solution step values.

Return type:

tuple[Coresubset[Data], Any]

Returns:

a refined coresubset and relevant intermediate solver state information

class coreax.solvers.ExplicitSizeSolver(coreset_size)[source]#

Bases: Solver

A Solver which produces a coreset of an explicitly specified size.

Parameters:

coreset_size (Any) – The desired size of the solved coreset

class coreax.solvers.PaddingInvariantSolver[source]#

Bases: Solver

A Solver whose results are invariant to zero weighted data.

In some cases, such as in coreax.solvers.MapReduce, there is a need to pad data to ensure shape stability. In these cases, we may assign zero weight to the padded data points, which allows certain ‘padding invariant’ solvers to return the same values on a call to reduce() as would have been returned if no padding were present.

Inheriting from this class is only a promise by the solver to obey the invariance property. Conformity with the property is not checked at runtime.

class coreax.solvers.CompositeSolver(base_solver)[source]#

Bases: Solver[_Coreset, _Data, _State], Generic[_Coreset, _Data, _State]

Base class for solvers that compose with/wrap other solvers.

Parameters:

base_solver (Solver[_Coreset, _Data, _State]) –

class coreax.solvers.MapReduce(base_solver, leaf_size, tree_type=<class 'sklearn.neighbors._kd_tree.KDTree'>)[source]#

Bases: CompositeSolver[_Coreset, _Data, _State], Generic[_Coreset, _Data, _State]

Calculate coreset of a given number of points using scalable reduction on blocks.

Uses a KDTree or BallTree to partition the original data into patches. Upon each of these a coreset of size base_solver.coreset_size is calculated. These coresets are concatenated to produce a larger coreset covering the whole of the original data, which has size greater than coreset_size. This coreset is now treated as the original data and reduced recursively until its size is equal to base_solver.coreset_size.

There is some intricate set-up:

  1. base_solver.coreset_size must be less than leaf_size.

  2. Zero weighted and valued padding will be used to ensure each partition has the same size if len(dataset) is not an integer multiple of leaf_size

Let \(n_k\) be the number of points after each recursion with \(n_0\) equal to the size of the original data. Then, each recursion reduces the size of the coreset such that

\[n_k <= \frac{n_{k - 1}}{\texttt{leaf_size}} \texttt{coreset_size},\]

so

\[n_k <= \left( \frac{\texttt{coreset_size}}{\texttt{leaf_size}} \right)^k n_0.\]

Thus, the number of iterations required is roughly (find \(k\) when \(n_k =\) base_solver.coreset_size)

\[\frac{ \log{\texttt{coreset_size}} - \log{\left(\text{original data size}\right)} }{ \log{\texttt{coreset_size}} - \log{\texttt{leaf_size}} } .\]
Parameters:
  • base_solver (Solver[Any, Data, Any]) – Solver to compose with; full support is currently provided for coreax.solvers.KernelHerding and coreax.solvers.SteinThinning; the solver’s result must be ignorant of/invariant to zero weighted padding.

  • leaf_size (Any) – Number of points to include in each partition; corresponds to 2*leaf_size in sklearn.neighbors.BinaryTree; must be greater than base_solver.coreset_size

  • tree_type (type[Union[KDTree, BallTree]]) – The type of binary tree based partitioning to use when splitting the dataset into smaller blocks.

tree_type#

alias of KDTree

reduce(dataset, solver_state=None)[source]#

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:
  • dataset (Data) – The (potentially weighted and supervised) data to generate the coreset from

  • solver_state (Optional[Any]) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[Any, Any]

Returns:

a tuple of the solved coreset and intermediate solver state information

class coreax.solvers.RandomSample(coreset_size, random_key, weighted=False, unique=True)[source]#

Bases: CoresubsetSolver[_Data, None], ExplicitSizeSolver

Reduce a dataset by randomly sampling a fixed number of points.

Parameters:
  • coreset_size (Any) – The desired size of the solved coreset

  • random_key (ArrayLike) – Key for random number generation

  • weighted (bool) – If to use dataset weights as selection probabilities

  • unique (bool) – If to sample without replacement

reduce(dataset, solver_state=None)[source]#

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:
  • dataset (Data) – The (potentially weighted and supervised) data to generate the coreset from

  • solver_state (None) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[Coresubset, None]

Returns:

a tuple of the solved coreset and intermediate solver state information

class coreax.solvers.HerdingState(gramian_row_mean)[source]#

Bases: Module

Intermediate KernelHerding solver state information.

Parameters:

gramian_row_mean (Array) – Cached Gramian row-mean.

class coreax.solvers.KernelHerding(coreset_size, kernel, unique=True, block_size=None, unroll=1)[source]#

Bases: RefinementSolver[Data, HerdingState], ExplicitSizeSolver, PaddingInvariantSolver

Kernel Herding - an explicitly sized coresubset refinement solver.

Solves the coresubset problem by taking a deterministic, iterative, and greedy approach to minimizing the (weighted) Maximum Mean Discrepancy (MMD) between the coresubset (the solution) and the problem dataset.

Given one has selected \(T\) data points for their compressed representation of the original dataset, kernel herding selects the next point using Equation 8 of [chen2012herding]:

\[x_{T+1} = \arg\max_{x} \left( \mathbb{E}[k(x, x')] - \frac{1}{T+1}\sum_{t=1}^T k(x, x_t) \right)\]

where \(k\) is the kernel used, the expectation \(\mathbb{E}\) is taken over the entire dataset, and the search is over the entire dataset. This can informally be seen as a balance between using points at which the underlying density is high (the first term) and exploration of distinct regions of the space (the second term).

Parameters:
reduce(dataset, solver_state=None)[source]#

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:
  • dataset (Data) – The (potentially weighted and supervised) data to generate the coreset from

  • solver_state (Optional[HerdingState]) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[Coresubset[Data], HerdingState]

Returns:

a tuple of the solved coreset and intermediate solver state information

refine(coresubset, solver_state=None)[source]#

Refine a coresubset with ‘Kernel Herding’.

We first compute the kernel’s Gramian row-mean if it is not given in the ‘solver_state’, and then iteratively swap points with the initial coreset, balancing selecting points in high density regions with selecting points far from those already in the coreset.

Warning

If the input coresubset is smaller than the requested coreset_size, it will be padded with zero-valued, zero-weighted indices. After the input coresubset has been refined, new indices will be chosen to fill the padding. If the input coresubset is larger than the requested coreset_size, the extra indices will not be optimised and will be clipped from the return coresubset.

Parameters:
  • coresubset (Coresubset[Data]) – The coresubset to refine

  • solver_state (Optional[HerdingState]) – Solution state information, primarily used to cache expensive intermediate solution step values.

Return type:

tuple[Coresubset[Data], HerdingState]

Returns:

A refined coresubset and relevant intermediate solver state information

class coreax.solvers.SteinThinning(coreset_size, kernel, score_matching=None, unique=True, regularise=True, block_size=None, unroll=1)[source]#

Bases: RefinementSolver[_Data, None], ExplicitSizeSolver, PaddingInvariantSolver

Stein Thinning - an explicitly sized coresubset refinement solver.

Solves the coresubset problem by taking a deterministic, iterative, and greedy approach to minimizing the Kernelised Stein Discrepancy (KSD) between the empirical distribution of the coresubset (the solution) and the distribution of the problem dataset, as characterised by the score function of the Stein Kernel.

Given one has selected \(T\) data points for their compressed representation of the original dataset, (regularised) Stein thinning selects the next point using the equations in Section 3.1 of [benard2023kernel]:

\[x_{T+1} = \arg\min_{x} \left( k_P(x, x) / 2 + \Delta^+ \log p(x) - \lambda T \log p(x) + \frac{1}{T+1}\sum_{t=1}^T k_P(x, x_t) \right)\]

where \(k\) is the Stein kernel induced by the supplied base kernel, \(\Delta^+\) is the non-negative Laplace operator, \(\lambda\) is a regularisation parameter, and the search is over the entire dataset.

Parameters:
reduce(dataset, solver_state=None)[source]#

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:
  • dataset (Data) – The (potentially weighted and supervised) data to generate the coreset from

  • solver_state (None) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[Coresubset[Data], None]

Returns:

a tuple of the solved coreset and intermediate solver state information

refine(coresubset, solver_state=None)[source]#

Refine a coresubset with ‘Stein thinning’.

We first compute a score function, and then the Stein kernel. This is used to greedily choose points in the coreset to minimise kernel Stein discrepancy (KSD).

Parameters:
  • coresubset (Coresubset[Data]) – The coresubset to refine

  • solver_state (None) – Solution state information, primarily used to cache expensive intermediate solution step values.

Return type:

tuple[Coresubset[Data], None]

Returns:

a refined coresubset and relevant intermediate solver state information

class coreax.solvers.RPCholeskyState(gramian_diagonal)[source]#

Bases: Module

Intermediate RPCholesky solver state information.

Parameters:

gramian_diagonal (Array) – Cached Gramian diagonal.

class coreax.solvers.RPCholesky(coreset_size, random_key, kernel, unique=True)[source]#

Bases: CoresubsetSolver[Data, RPCholeskyState], ExplicitSizeSolver

Randomly Pivoted Cholesky - an explicitly sized coresubset refinement solver.

Solves the coresubset problem by taking a stochastic, iterative, and greedy approach to approximating the Gramian of a given kernel, evaluated on the original dataset.

Parameters:
  • coreset_size (Any) – The desired size of the solved coreset

  • random_key (ArrayLike) – Key for random number generation

  • kernel (ScalarValuedKernel) – ScalarValuedKernel instance implementing a kernel function \(k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}\)

  • unique (bool) – If each index in the resulting coresubset should be unique

reduce(dataset, solver_state=None)[source]#

Reduce ‘dataset’ to a Coresubset with ‘RPCholesky’.

This is done by first computing the kernel Gram matrix of the original data, and isolating the diagonal of this. A ‘pivot point’ is then sampled, where sampling probabilities correspond to the size of the elements on this diagonal. The data-point corresponding to this pivot point is added to the coreset, and the diagonal of the Gram matrix is updated to add a repulsion term of sorts - encouraging the coreset to select a range of distinct points in the original data. The pivot sampling and diagonal updating steps are repeated until \(M\) points have been selected.

Parameters:
  • dataset (Data) – The dataset to reduce to a coresubset

  • solver_state (Optional[RPCholeskyState]) – Solution state information, primarily used to cache expensive intermediate solution step values.

Return type:

tuple[Coresubset[Data], RPCholeskyState]

Returns:

a refined coresubset and relevant intermediate solver state information

class coreax.solvers.GreedyKernelPointsState(feature_gramian)[source]#

Bases: Module

Intermediate GreedyKernelPoints solver state information.

Parameters:

feature_gramian (Array) – Cached feature kernel gramian matrix, should be padded with an additional row and column of zeros.

class coreax.solvers.GreedyKernelPoints(coreset_size, random_key, feature_kernel, regularisation_parameter=1e-06, unique=True, batch_size=None, least_squares_solver=None)[source]#

Bases: RefinementSolver[SupervisedData, GreedyKernelPointsState], ExplicitSizeSolver

Apply GreedyKernelPoints to a supervised dataset.

GreedyKernelPoints is a deterministic, iterative and greedy approach to build a coreset adapted from the inducing point method developed in [nguyen2021meta].

Given one has an original dataset \(\mathcal{D}^{(1)} = \{(x_i, y_i)\}_{i=1}^n\) of \(n\) pairs with \(x\in\mathbb{R}^d\) and \(y\in\mathbb{R}^p\), and one has selected \(m\) data pairs \(\mathcal{D}^{(2)} = \{(\tilde{x}_i, \tilde{y}_i)\}_{i=1}^m\) already for their compressed representation of the original dataset, GreedyKernelPoints selects the next point to minimise the loss

\[L(\mathcal{D}^{(1)}, \mathcal{D}^{(2)}) = ||y^{(1)} - K^{(12)}(K^{(22)} + \lambda I_m)^{-1}y^{(2)} ||^2_{\mathbb{R}^n}\]

where \(y^{(1)}\in\mathbb{R}^n\) is the vector of responses from \(\mathcal{D}^{(1)}\), \(y^{(2)}\in\mathbb{R}^n\) is the vector of responses from \(\mathcal{D}^{(2)}\), \(K^{(12)} \in \mathbb{R}^{n\times m}\) is the cross-matrix of kernel evaluations between \(\mathcal{D}^{(1)}\) and \(\mathcal{D}^{(2)}\), \(K^{(22)} \in \mathbb{R}^{m\times m}\) is the kernel matrix on \(\mathcal{D}^{(2)}\), \(\lambda I_m \in \mathbb{R}^{m \times m}\) is the identity matrix and \(\lambda \in \mathbb{R}_{>0}\) is a regularisation parameter.

This class works with all children of ScalarValuedKernel. Note that GreedyKernelPoints does not support non-uniform weights and will only return coresubsets with uniform weights.

Parameters:
reduce(dataset, solver_state=None)[source]#

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:
  • dataset (SupervisedData) – The (potentially weighted and supervised) data to generate the coreset from

  • solver_state (Optional[GreedyKernelPointsState]) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[Coresubset[SupervisedData], GreedyKernelPointsState]

Returns:

a tuple of the solved coreset and intermediate solver state information

refine(coresubset, solver_state=None)[source]#

Refine a coresubset with ‘GreedyKernelPointsState’.

We first compute the various factors if they are not given in the solver_state, and then iteratively swap points with the initial coreset, selecting points which minimise the loss.

Warning

If the input coresubset is smaller than the requested coreset_size, it will be padded with indices that are invariant with respect to the loss. After the input coresubset has been refined, new indices will be chosen to fill the padding. If the input coresubset is larger than the requested coreset_size, the extra indices will not be optimised and will be clipped from the return coresubset.

Parameters:
Return type:

tuple[Coresubset[SupervisedData], GreedyKernelPointsState]

Returns:

A refined coresubset and relevant intermediate solver state information

class coreax.solvers.RecombinationSolver(test_functions=None, mode='implicit-explicit')[source]#

Bases: CoresubsetSolver[_Data, _State], Generic[_Data, _State]

Solver which returns a Coresubset via recombination.

Given \(m-1\) explicitly provided test-functions \(\Phi^\prime\), a recombination solver finds a coresubset with \(m^\prime \le m\) points, whose push-forward \(\hat{\mu}_{m^\prime}\) has the same “centre-of-mass” (CoM) as the dataset push-forward \(\mu_n := \Phi_* \nu_n\).

Parameters:
  • test_functions (Optional[Callable[[Array], Real[Array, 'm-1']]]) – A callable that applies a set of specified test-functions \(\Phi^\prime = \{\phi_1,\dots,\phi_{m-1}\}\) where each function is a map \(\phi_i \colon \Omega\to\mathbb{R}\); a value of None implies the identity map \(\Phi^\prime \colon x \mapsto x\), and necessarily assumes that \(x \in \Omega \subseteq \mathbb{R}^{m-1}\)

  • mode (Literal['implicit-explicit', 'implicit', 'explicit']) – ‘implicit-explicit’ explicitly removes \(n - m\) points, yielding a coreset of size \(m\), with \(m - m^\prime\) zero-weighted (implicitly removed) points; ‘implicit’ explicitly removes no points, yielding a coreset of size \(n\) with \(n - m^\prime\) zero-weighted (implicitly removed) points; ‘explicit’ explicitly removes \(n - m^\prime\) points, yielding a coreset of size \(m^\prime\), but unlike the other methods is not JIT compatible as the coreset size \(m^\prime\) is unknown at compile time.

class coreax.solvers.CaratheodoryRecombination(test_functions=None, mode='implicit-explicit', rcond=None)[source]#

Bases: RecombinationSolver[Data, None]

Recombination via Caratheodory measure reduction (Gaussian-Elimination).

Proposed in [tchernychova2016recombination] (see Chapter 1.3.3.3) as an alternative to the Simplex algorithm for solving the recombination problem.

Unlike the Simplex method, with time complexity \(\mathcal{O}(m^3 n + m n^2)\), Caratheodory recombination has time complexity of only \(\mathcal{O}(m n^2)\).

Note

Given \(n = cm\), for a rational constant \(c\), the above complexities can be alternatively represented as \(\mathcal{O}(m^4)\) for the Simplex method and \(\mathcal{O}(m^3)\) for Caratheodory recombination.

Parameters:
  • test_functions (Optional[Callable[[Array], Real[Array, 'm-1']]]) – A callable that applies a set of specified test-functions \(\Phi^\prime = \{\phi_1,\dots,\phi_{m-1}\}\) where each function is a map \(\phi_i \colon \Omega \to \mathbb{R}\); a value of non implies the identity map \(\Phi^\prime \colon x \mapsto x\), and necessarily assumes that \(x \in \Omega \subseteq \mathbb{R}^{m-1}\)

  • mode (Literal['implicit-explicit', 'implicit', 'explicit']) – ‘implicit-explicit’ explicitly removes \(n - m\) points, yielding a coreset of size \(m\), with \(m - m^\prime\) zero-weighted (implicitly removed) points; ‘implicit’ explicitly removes no points, yielding a coreset of size \(n\) with \(n - m^\prime\) zero-weighted (implicitly removed) points; ‘explicit’ explicitly removes \(n - m^\prime\) points, yielding a coreset of size \(m^\prime\), but unlike the other methods is not JIT compatible as the coreset size \(m^\prime\) is unknown at compile time.

  • rcond (Optional[float]) – A relative condition number; any singular value \(s\) below the threshold \(\text{rcond} * \text{max}(s)\) is treated as equal to zero; if rcond is None, it defaults to floating point eps * max(n, d)

reduce(dataset, solver_state=None)[source]#

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:
  • dataset (Data) – The (potentially weighted and supervised) data to generate the coreset from

  • solver_state (None) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[Coresubset, None]

Returns:

a tuple of the solved coreset and intermediate solver state information

class coreax.solvers.TreeRecombination(test_functions=None, mode='implicit-explicit', rcond=None, tree_reduction_factor=2)[source]#

Bases: RecombinationSolver[Data, None]

Tree recombination based coresubset solver.

Based on Algorithm 7 Chapter 3.3 of [tchernychova2016recombination], which is an order of magnitude more efficient than Algorithm 5 in Chapter 3.2, originally introduced in [litterer2012recombination].

The time complexity is of order \(\mathcal{O}(\log_2(\frac{n}{c_r m}) m^3)\), where :math`c_r` is the tree_reduction_factor. The time complexity can be equivalently expressed as \(\mathcal{O}(m^3)\), using the same arguments as used in CaratheodoryRecombination.

Note

As the ratio of \(n / m\) grows, the constant factor for the time complexity of TreeRecombination increases at a logarithmic rate, rather than at a quadratic rate for plain CaratheodoryRecombination. Hence, in general, we would expect TreeRecombination to be the more efficient choice for all but the smallest values of \(n / m\).

Parameters:
  • test_functions (Optional[Callable[[Array], Real[Array, 'm-1']]]) – the map \(\Phi^\prime = \{ \phi_1, \dots, \phi_{M-1} \}\) where each \(\phi_i \colon \Omega \to \mathbb{R}\) represents a linearly independent test-function; a value of None implies the identity function (necessarily assuming \(\Omega \subseteq \mathbb{R}^{M-1}\))

  • mode (Literal['implicit-explicit', 'implicit', 'explicit']) – ‘implicit-explicit’ explicitly removes \(n - m\) points, yielding a coreset of size \(m\), with \(m - m^\prime\) zero-weighted (implicitly removed) points; ‘implicit’ explicitly removes no points, yielding a coreset of size \(n\) with \(n - m^\prime\) zero-weighted (implicitly removed) points; ‘explicit’ explicitly removes \(n - m^\prime\) points, yielding a coreset of size \(m^\prime\), but unlike the other methods is not JIT compatible as the coreset size \(m^\prime\) is unknown at compile time.

  • rcond (Optional[float]) – a relative condition number; any singular value \(s\) below the threshold \(\text{rcond} * \text{max}(s)\) is treated as equal to zero; if rcond is None, it defaults to floating point eps * max(n, d)

  • tree_reduction_factor (int) – The factor by which each tree reduction step reduces the number of non-zero points; the remaining number of non-zero nodes, after performing recombination, is equal to n_nodes / tree_reduction_factor;

reduce(dataset, solver_state=None)[source]#

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:
  • dataset (Data) – The (potentially weighted and supervised) data to generate the coreset from

  • solver_state (None) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[Coresubset, None]

Returns:

a tuple of the solved coreset and intermediate solver state information