Solvers¶

Solvers for generating coresets.

class coreax.solvers.Solver[source]¶

Bases: Module, Generic[_Coreset, _Data, _State]

Base class for coreset solvers.

Solver is generic on the type of data required by the reduce method, and the type of coreset returned, providing a convenient means to distinguish between solvers that take (weighted) data/supervised data, and those which produce coresets/coresubsets.

abstract reduce(dataset, solver_state=None)[source]¶

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:

dataset (Any) – The (potentially weighted and supervised) data to generate the coreset from
solver_state (Optional[Any]) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[AbstractCoreset, Any]

Returns:

a tuple of the solved coreset and intermediate solver state information

class coreax.solvers.CoresubsetSolver[source]¶

Bases: Solver[Coresubset[_Data], _Data, _State], Generic[_Data, _State]

Solver which returns a coreax.coreset.Coresubset.

A convenience class for the most common solver type in this package.

class coreax.solvers.RefinementSolver[source]¶

Bases: CoresubsetSolver[_Data, _State], Generic[_Data, _State]

A CoresubsetSolver which supports refinement.

Some solvers assume implicitly/explicitly an initial coresubset on which the solution is dependent. Such solvers can be interpreted as refining the initial coresubset to produce another (solution) coresubset.

By providing a ‘refine’ method, one can compose the results of different solvers together, and/or repeatedly apply/chain the result of a refinement based solve.

# An example of repeated application/chaining of solutions/solvers.
result, state = solver.reduce(dataset)
refined_result, state = refine_solver.refine(result, state)
re_refined_result, state = refine_solver.refine(refined_result, state)

abstract refine(coresubset, solver_state=None)[source]¶

Refine a coresubset - swap/update coresubset indices.

Parameters:

coresubset (Coresubset[Any]) – The coresubset to refine
solver_state (Optional[Any]) – Solution state information, primarily used to cache expensive intermediate solution step values.

Return type:

tuple[Coresubset[Any], Any]

Returns:

a refined coresubset and relevant intermediate solver state information

class coreax.solvers.ExplicitSizeSolver(coreset_size)[source]¶

Bases: Solver[_Coreset, _Data, _State], Generic[_Coreset, _Data, _State]

A Solver which produces a coreset of an explicitly specified size.

Parameters:: coreset_size (Any) – The desired size of the solved coreset

class coreax.solvers.PaddingInvariantSolver[source]¶

Bases: Solver[_Coreset, _Data, _State], Generic[_Coreset, _Data, _State]

A Solver whose results are invariant to zero weighted data.

In some cases, such as in coreax.solvers.MapReduce, there is a need to pad data to ensure shape stability. In these cases, we may assign zero weight to the padded data points, which allows certain ‘padding invariant’ solvers to return the same values on a call to reduce() as would have been returned if no padding were present.

Inheriting from this class is only a promise by the solver to obey the invariance property. Conformity with the property is not checked at runtime.

class coreax.solvers.CompositeSolver(base_solver)[source]¶

Bases: Solver[_Coreset, _Data, _State], Generic[_Coreset, _Data, _State]

Base class for solvers that compose with/wrap other solvers.

Parameters:: base_solver (Solver[_Coreset, _Data, _State])

class coreax.solvers.MapReduce(base_solver, leaf_size, tree_type=<class 'sklearn.neighbors._kd_tree.KDTree'>)[source]¶

Bases: CompositeSolver[_Coreset, _Data, _State], Generic[_Coreset, _Data, _State]

Calculate coreset of a given number of points using scalable reduction on blocks.

Uses a KDTree or BallTree to partition the original data into patches. Upon each of these a coreset of size base_solver.coreset_size is calculated. These coresets are concatenated to produce a larger coreset covering the whole of the original data, which has size greater than coreset_size. This coreset is now treated as the original data and reduced recursively until its size is equal to base_solver.coreset_size.

There is some intricate set-up:

base_solver.coreset_size must be less than leaf_size.
Zero weighted and valued padding will be used to ensure each partition has the same size if len(dataset) is not an integer multiple of leaf_size

Let \(n_k\) be the number of points after each recursion with \(n_0\) equal to the size of the original data. Then, each recursion reduces the size of the coreset such that

\[n_k <= \frac{n_{k - 1}}{\texttt{leaf_size}} \texttt{coreset_size},\]

\[n_k <= \left( \frac{\texttt{coreset_size}}{\texttt{leaf_size}} \right)^k n_0.\]

Thus, the number of iterations required is roughly (find \(k\) when \(n_k =\) base_solver.coreset_size)

\[\frac{ \log{\texttt{coreset_size}} - \log{\left(\text{original data size}\right)} }{ \log{\texttt{coreset_size}} - \log{\texttt{leaf_size}} } .\]

Parameters:

base_solver (Solver[AbstractCoreset, Any, Any]) – Solver to compose with; full support is currently provided for coreax.solvers.KernelHerding and coreax.solvers.SteinThinning; the solver’s result must be ignorant of/invariant to zero weighted padding.
leaf_size (Any) – Number of points to include in each partition; corresponds to 2*leaf_size in sklearn.neighbors.BinaryTree; must be greater than base_solver.coreset_size
tree_type (type[Union[KDTree, BallTree]]) – The type of binary tree based partitioning to use when splitting the dataset into smaller blocks.

tree_type¶: alias of KDTree

reduce(dataset, solver_state=None)[source]¶

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:

dataset (Any) – The (potentially weighted and supervised) data to generate the coreset from
solver_state (Optional[Any]) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[AbstractCoreset, Any]

Returns:

a tuple of the solved coreset and intermediate solver state information

class coreax.solvers.RandomSample(coreset_size, random_key, weighted=False, unique=True)[source]¶

Bases: CoresubsetSolver[_Data, None], ExplicitSizeSolver

Reduce a dataset by randomly sampling a fixed number of points.

Parameters:

coreset_size (Any) – The desired size of the solved coreset
random_key (Array) – Key for random number generation
weighted (bool) – If to use dataset weights as selection probabilities
unique (bool) – If to sample without replacement

reduce(dataset, solver_state=None)[source]¶

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:

dataset (Any) – The (potentially weighted and supervised) data to generate the coreset from
solver_state (None) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[Coresubset, None]

Returns:

a tuple of the solved coreset and intermediate solver state information

class coreax.solvers.HerdingState(gramian_row_mean)[source]¶

Bases: Module

Intermediate KernelHerding solver state information.

Parameters:: gramian_row_mean (Array) – Cached Gramian row-mean.

class coreax.solvers.KernelHerding(coreset_size, kernel, unique=True, block_size=None, unroll=1, probabilistic=False, temperature=1.0, random_key=<factory>)[source]¶

Bases: RefinementSolver[_Data, HerdingState], ExplicitSizeSolver, PaddingInvariantSolver

Kernel Herding - an explicitly sized coresubset refinement solver.

Solves the coresubset problem by taking a deterministic, iterative, and greedy approach to minimizing the (weighted) Maximum Mean Discrepancy (MMD) between the coresubset (the solution) and the problem dataset.

Given one has selected \(T\) data points for their compressed representation of the original dataset, kernel herding selects the next point using Equation 8 of [chen2012herding]:

\[x_{T+1} = \arg\max_{x} \left( \mathbb{E}[k(x, x')] - \frac{1}{T+1}\sum_{t=1}^T k(x, x_t) \right)\]

where \(k\) is the kernel used, the expectation \(\mathbb{E}\) is taken over the entire dataset, and the search is over the entire dataset. This can informally be seen as a balance between using points at which the underlying density is high (the first term) and exploration of distinct regions of the space (the second term).

Optionally, the Kernel Herding procedure can be modified using the probabilistic and temperature parameters in the reduce and refine methods. If probabilistic is True, a single point \(x\) at each iteration is selected with probability proportional to \(\text{softmax}(\frac{\text{KHMetric(x)}}{T})\), where \(\text{ KHMetric}\) is given above and \(T\) is the temperature parameter. As \(T \rightarrow \infty\), the probabilities become uniform, resulting in a random sampling. As \(T \rightarrow 0\), the probabilities become concentrated at the point with the highest metric, recovering the original Kernel Herding procedure. This feature is experimental and does not come from the original paper ([chen2012herding]).

Parameters:

coreset_size (Any) – The desired size of the solved coreset
kernel (ScalarValuedKernel) – ScalarValuedKernel instance implementing a kernel function \(k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}\)
unique (bool) – Boolean that ensures the resulting coresubset will only contain unique elements
block_size (Union[int, tuple[Optional[int], Optional[int]], None]) – Block size passed to compute_mean()
unroll (Union[int, bool, tuple[Union[int, bool], Union[int, bool]]]) – Unroll parameter passed to compute_mean()
probabilistic (bool) – If True, the elements are chosen probabilistically at each iteration. Otherwise, standard Kernel Herding is run.
temperature (Union[float, Shaped[Array, '']]) – Temperature parameter, which controls how uniform the probabilities are for probabilistic selection.
random_key (Array) – Key for random number generation, only used if probabilistic

reduce(dataset, solver_state=None)[source]¶

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:

dataset (Any) – The (potentially weighted and supervised) data to generate the coreset from
solver_state (Optional[HerdingState]) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[Coresubset[Any], HerdingState]

Returns:

a tuple of the solved coreset and intermediate solver state information

refine(coresubset, solver_state=None)[source]¶

Refine a coresubset with ‘Kernel Herding’.

We first compute the kernel’s Gramian row-mean if it is not given in the ‘solver_state’, and then iteratively swap points with the initial coreset, balancing selecting points in high density regions with selecting points far from those already in the coreset.

Warning

If the input coresubset is smaller than the requested coreset_size, it will be padded with zero-valued, zero-weighted indices. After the input coresubset has been refined, new indices will be chosen to fill the padding. If the input coresubset is larger than the requested coreset_size, the extra indices will not be optimised and will be clipped from the return coresubset.

Parameters:

coresubset (Coresubset[Any]) – The coresubset to refine
solver_state (Optional[HerdingState]) – Solution state information, primarily used to cache expensive intermediate solution step values.

Return type:

tuple[Coresubset[Any], HerdingState]

Returns:

A refined coresubset and relevant intermediate solver state information

class coreax.solvers.KernelThinning(coreset_size, kernel, random_key, delta, sqrt_kernel)[source]¶

Bases: CoresubsetSolver[_Data, None], ExplicitSizeSolver

Kernel Thinning - a hierarchical coreset construction solver.

Kernel Thinning is a hierarchical, and probabilistic algorithm for coreset construction. It builds a coreset by splitting the dataset into several candidate coresets by repeatedly halving the dataset and applying probabilistic swapping. The best of these candidates (the one with the lowest MMD) is chosen which is further refined to minimise the Maximum Mean Discrepancy (MMD) between the original dataset and the coreset. This implementation is a modification of the Kernel Thinning algorithm in [dwivedi2024kernelthinning] to make it an ExplicitSizeSolver.

Parameters:

kernel (ScalarValuedKernel) – A ~coreax.kernels.ScalarValuedKernel instance defining the primary kernel function used for choosing the best coreset and refining it.
random_key (Array) – Key for random number generation, enabling reproducibility of probabilistic components in the algorithm.
delta (float) – A float between 0 and 1 used to compute the swapping probability during the splitting process. A recommended value is \(\frac{1}{n \log (\log n)}\), where \(n\) is the length of the original dataset.
sqrt_kernel (ScalarValuedKernel) – A ~coreax.kernels.ScalarValuedKernel instance representing the square root kernel used for splitting the original dataset.
coreset_size (int)

reduce(dataset, solver_state=None)[source]¶

Reduce ‘dataset’ to a Coresubset with ‘KernelThinning’.

This is done by first computing the number of halving steps required, referred to as m. The original data is clipped so that it is divisible by a power of two. The kernel halving algorithm is then recursively applied to halve the data.

Subsequently, a baseline_coreset is added to the ensemble of coresets. The best coreset is selected to minimise the Maximum Mean Discrepancy (MMD) and finally, it is refined further for optimal results. This final refinement step can reintroduce the clipped data dataset if they are found to reduce the MMD.

Parameters:

dataset (Any) – The original dataset to be reduced.
solver_state (None) – The state of the solver.

Return type:

tuple[Coresubset[Any], None]

Returns:

A tuple containing the final coreset and the solver state (None).

kt_half_recursive(current_coreset, m, original_dataset)[source]¶

Recursively halve the original dataset into coresets.

Parameters:

current_coreset (Union[Any, Coresubset[Any]]) – The current coreset or dataset being partitioned.
m (int) – The remaining depth of recursion.
original_dataset (Any) – The original dataset.

Return type:

list[Coresubset[Any]]

Returns:

Fully partitioned list of coresets.

kt_half(dataset)[source]¶

Partition the given dataset into two subsets.

First, initialise two coresubsets, each of which will contain half the points of the original dataset. Divide the points of the original dataset into pairs and probabilistically decide which point of the pair should go to which of the coresets. This function uses variables such as a, b, sigma, and delta, they refer to the corresponding parameters in the paper [dwivedi2024kernelthinning].

Parameters:: dataset (Any) – The input dataset to be halved.
Return type:: tuple[Coresubset[Any], Coresubset[Any]]
Returns:: A list containing the two partitioned coresets.

get_baseline_coreset(dataset, baseline_coreset_size)[source]¶

Generate a baseline coreset by randomly sampling from the dataset.

Parameters:

dataset (Data) – The input dataset from which the baseline coreset is sampled.
baseline_coreset_size (int) – The number of dataset in the baseline coreset.

Return type:

Coresubset[Any]

Returns:

A randomly sampled baseline coreset with the specified size.

kt_choose(candidate_coresets, points)[source]¶

Select the best coreset from a list of candidate coresets based on MMD.

Parameters:

candidate_coresets (list[Coresubset[Any]]) – A list of candidate coresets to be evaluated.
points (Any) – The original dataset against which the coresets are compared.

Return type:

Shaped[Array, 'coreset_size']

Returns:

The coreset with the smallest MMD relative to the input dataset.

kt_refine(candidate_coreset)[source]¶

Refine the selected candidate coreset.

Use refine() which achieves the result of looping through each element in coreset replacing that element with a point in the original dataset to minimise MMD in each step.

Parameters:: candidate_coreset (Coresubset[Any]) – The candidate coreset to be refined.
Return type:: tuple[Coresubset[Any], None]
Returns:: The refined coreset.

class coreax.solvers.SteinThinning(coreset_size, kernel, score_matching=None, unique=True, regularise=True, regulariser_lambda=None, block_size=None, unroll=1)[source]¶

Bases: RefinementSolver[_Data, None], ExplicitSizeSolver, PaddingInvariantSolver

Stein Thinning - an explicitly sized coresubset refinement solver.

Solves the coresubset problem by taking a deterministic, iterative, and greedy approach to minimizing the Kernelised Stein Discrepancy (KSD) between the empirical distribution of the coresubset (the solution) and the distribution of the problem dataset, as characterised by the score function of the Stein Kernel.

Given one has selected \(t-1\) data points for their compressed representation of the original dataset, (regularised) Stein thinning selects the next point using the equations in Section 3.1 of [benard2023kernel]:

\[x_{t} = \arg\min_{x} \left( k_p(x, x) + \Delta^+ \log p(x) - \lambda t \log p(x) + 2 \sum_{j=1}^{t-1} k_p(x, x_j) \right)\]

where \(k\) is the Stein kernel induced by the supplied base kernel, \(\Delta^+\) is the non-negative Laplace operator, \(\lambda\) is a regularisation parameter, and the search is over the entire dataset.

Parameters:

coreset_size (Any) – The desired size of the solved coreset
kernel (ScalarValuedKernel) – ScalarValuedKernel instance implementing a kernel function \(k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}\); if ‘kernel’ is a SteinKernel and score_matching is notNone, a new instance of the kernel will be generated where the score function is given by score_matching.match(...)
score_matching (Optional[ScoreMatching]) – Specifies/overwrite the score function of the implied/passed SteinKernel; if None, default to KernelDensityMatching unless ‘kernel’ is a SteinKernel, in which case the kernel’s existing score function is used.
unique (bool) – If each index in the resulting coresubset should be unique
regularise (bool) – Boolean that enforces regularisation, see Section 3.1 of [benard2023kernel].
regulariser_lambda (Optional[float]) – The entropic regularisation parameter, \(\lambda\). If None, defaults to \(1/\text{coreset_size}\) following [benard2023kernel].
block_size (Union[int, tuple[Optional[int], Optional[int]], None]) – Block size passed to compute_mean()
unroll (Union[int, bool, tuple[Union[int, bool], Union[int, bool]]]) – Unroll parameter passed to compute_mean()

reduce(dataset, solver_state=None)[source]¶

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:

dataset (Any) – The (potentially weighted and supervised) data to generate the coreset from
solver_state (None) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[Coresubset[Any], None]

Returns:

a tuple of the solved coreset and intermediate solver state information

refine(coresubset, solver_state=None)[source]¶

Refine a coresubset with ‘Stein thinning’.

We first compute a score function, and then the Stein kernel. This is used to greedily choose points in the coreset to minimise kernel Stein discrepancy (KSD).

Note

Only the score function, \(\nabla \log p(x)\), is provided to the solver. Since the lambda regularisation term relies on the density, \(p(x)\), directly, it is estimated using a Gaussian kernel density estimator.

Parameters:

coresubset (Coresubset[Any]) – The coresubset to refine
solver_state (None) – Solution state information, primarily used to cache expensive intermediate solution step values.

Return type:

tuple[Coresubset[Any], None]

Returns:

a refined coresubset and relevant intermediate solver state information

class coreax.solvers.RPCholeskyState(gramian_diagonal)[source]¶

Bases: Module

Intermediate RPCholesky solver state information.

Parameters:: gramian_diagonal (Array) – Cached Gramian diagonal.

class coreax.solvers.RPCholesky(coreset_size, random_key, kernel, unique=True)[source]¶

Bases: CoresubsetSolver[_Data, RPCholeskyState], ExplicitSizeSolver

Randomly Pivoted Cholesky - an explicitly sized coresubset refinement solver.

Solves the coresubset problem by taking a stochastic, iterative, and greedy approach to approximating the Gramian of a given kernel, evaluated on the original dataset.

Parameters:

coreset_size (Any) – The desired size of the solved coreset
random_key (Array) – Key for random number generation
kernel (ScalarValuedKernel) – ScalarValuedKernel instance implementing a kernel function \(k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}\)
unique (bool) – If each index in the resulting coresubset should be unique

reduce(dataset, solver_state=None)[source]¶

Reduce ‘dataset’ to a Coresubset with ‘RPCholesky’.

This is done by first computing the kernel Gram matrix of the original data, and isolating the diagonal of this. A ‘pivot point’ is then sampled, where sampling probabilities correspond to the size of the elements on this diagonal. The data-point corresponding to this pivot point is added to the coreset, and the diagonal of the Gram matrix is updated to add a repulsion term of sorts - encouraging the coreset to select a range of distinct points in the original data. The pivot sampling and diagonal updating steps are repeated until \(M\) points have been selected.

Note

The RPCholesky algorithm sometimes converges before reaching self.coreset_size. In this case, the other coreset points are chosen uniformly at random, avoiding repetition.

Parameters:

dataset (Any) – Dataset to reduce to a coresubset.
solver_state (Optional[RPCholeskyState]) – Solution state information, primarily used to cache expensive intermediate solution step values.

Return type:

tuple[Coresubset[Any], RPCholeskyState]

Returns:

Refined coresubset and relevant intermediate solver state information.

class coreax.solvers.GreedyKernelPointsState(feature_gramian)[source]¶

Bases: Module

Intermediate GreedyKernelPoints solver state information.

Parameters:: feature_gramian (Array) – Cached feature kernel gramian matrix, should be padded with an additional row and column of zeros.

class coreax.solvers.GreedyKernelPoints(coreset_size, random_key, feature_kernel, regularisation_parameter=1e-06, unique=True, batch_size=None, least_squares_solver=None)[source]¶

Bases: RefinementSolver[SupervisedData, GreedyKernelPointsState], ExplicitSizeSolver

Apply GreedyKernelPoints to a supervised dataset.

GreedyKernelPoints is a deterministic, iterative and greedy approach to build a coreset adapted from the inducing point method developed in [nguyen2021meta].

Given one has an original dataset \(\mathcal{D}^{(1)} = \{(x_i, y_i)\}_{i=1}^n\) of \(n\) pairs with \(x\in\mathbb{R}^d\) and \(y\in\mathbb{R}^p\), and one has selected \(m\) data pairs \(\mathcal{D}^{(2)} = \{(\tilde{x}_i, \tilde{y}_i)\}_{i=1}^m\) already for their compressed representation of the original dataset, GreedyKernelPoints selects the next point to minimise the loss

\[L(\mathcal{D}^{(1)}, \mathcal{D}^{(2)}) = ||y^{(1)} - K^{(12)}(K^{(22)} + \lambda I_m)^{-1}y^{(2)} ||^2_{\mathbb{R}^n}\]

where \(y^{(1)}\in\mathbb{R}^n\) is the vector of responses from \(\mathcal{D}^{(1)}\), \(y^{(2)}\in\mathbb{R}^n\) is the vector of responses from \(\mathcal{D}^{(2)}\), \(K^{(12)} \in \mathbb{R}^{n\times m}\) is the cross-matrix of kernel evaluations between \(\mathcal{D}^{(1)}\) and \(\mathcal{D}^{(2)}\), \(K^{(22)} \in \mathbb{R}^{m\times m}\) is the kernel matrix on \(\mathcal{D}^{(2)}\), \(\lambda I_m \in \mathbb{R}^{m \times m}\) is the identity matrix and \(\lambda \in \mathbb{R}_{>0}\) is a regularisation parameter.

This class works with all children of ScalarValuedKernel. Note that GreedyKernelPoints does not support non-uniform weights and will only return coresubsets with uniform weights.

Parameters:

random_key (Array) – Key for random number generation
feature_kernel (ScalarValuedKernel) – ScalarValuedKernel instance implementing a kernel function \(k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}\) on the feature space
regularisation_parameter (float) – Regularisation parameter for stable inversion of the feature Gramian
unique (bool) – If False, the resulting coresubset may contain the same point multiple times. If True (default), the resulting coresubset will not contain any duplicate points
batch_size (Optional[int]) – An integer representing the size of the batches of data pairs sampled at each iteration for consideration for adding to the coreset. If None (default), the search is performed over the entire dataset
least_squares_solver (Optional[RegularisedLeastSquaresSolver]) – Instance of coreax.least_squares.RegularisedLeastSquaresSolver, default value of None uses coreax.least_squares.MinimalEuclideanNormSolver
coreset_size (int)

reduce(dataset, solver_state=None)[source]¶

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:

dataset (SupervisedData) – The (potentially weighted and supervised) data to generate the coreset from
solver_state (Optional[GreedyKernelPointsState]) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[Coresubset[SupervisedData], GreedyKernelPointsState]

Returns:

a tuple of the solved coreset and intermediate solver state information

refine(coresubset, solver_state=None)[source]¶

Refine a coresubset with ‘GreedyKernelPointsState’.

We first compute the various factors if they are not given in the solver_state, and then iteratively swap points with the initial coreset, selecting points which minimise the loss.

Warning

If the input coresubset is smaller than the requested coreset_size, it will be padded with indices that are invariant with respect to the loss. After the input coresubset has been refined, new indices will be chosen to fill the padding. If the input coresubset is larger than the requested coreset_size, the extra indices will not be optimised and will be clipped from the return coresubset.

Parameters:

coresubset (Coresubset[SupervisedData]) – The coresubset to refine
solver_state (Optional[GreedyKernelPointsState]) – Solution state information, primarily used to cache expensive intermediate solution step values.

Return type:

tuple[Coresubset[SupervisedData], GreedyKernelPointsState]

Returns:

A refined coresubset and relevant intermediate solver state information

class coreax.solvers.RecombinationSolver(test_functions=None, mode='implicit-explicit')[source]¶

Bases: CoresubsetSolver[_Data, _State], Generic[_Data, _State]

Solver which returns a Coresubset via recombination.

Given \(m-1\) explicitly provided test-functions \(\Phi^\prime\), a recombination solver finds a coresubset with \(m^\prime \le m\) points, whose push-forward \(\hat{\mu}_{m^\prime}\) has the same “centre-of-mass” (CoM) as the dataset push-forward \(\mu_n := \Phi_* \nu_n\).

Parameters:

test_functions (Optional[Callable[[Array], Real[Array, 'm-1']]]) – A callable that applies a set of specified test-functions \(\Phi^\prime = \{\phi_1,\dots,\phi_{m-1}\}\) where each function is a map \(\phi_i \colon \Omega\to\mathbb{R}\); a value of None implies the identity map \(\Phi^\prime \colon x \mapsto x\), and necessarily assumes that \(x \in \Omega \subseteq \mathbb{R}^{m-1}\)
mode (Literal['implicit-explicit', 'implicit', 'explicit']) – ‘implicit-explicit’ explicitly removes \(n - m\) points, yielding a coreset of size \(m\), with \(m - m^\prime\) zero-weighted (implicitly removed) points; ‘implicit’ explicitly removes no points, yielding a coreset of size \(n\) with \(n - m^\prime\) zero-weighted (implicitly removed) points; ‘explicit’ explicitly removes \(n - m^\prime\) points, yielding a coreset of size \(m^\prime\), but unlike the other methods is not JIT compatible as the coreset size \(m^\prime\) is unknown at compile time.

class coreax.solvers.CaratheodoryRecombination(test_functions=None, mode='implicit-explicit', rcond=None)[source]¶

Bases: RecombinationSolver[Data, None]

Recombination via Caratheodory measure reduction (Gaussian-Elimination).

Proposed in [tchernychova2016recombination] (see Chapter 1.3.3.3) as an alternative to the Simplex algorithm for solving the recombination problem.

Unlike the Simplex method, with time complexity \(\mathcal{O}(m^3 n + m n^2)\), Caratheodory recombination has time complexity of only \(\mathcal{O}(m n^2)\).

Note

Given \(n = cm\), for a rational constant \(c\), the above complexities can be alternatively represented as \(\mathcal{O}(m^4)\) for the Simplex method and \(\mathcal{O}(m^3)\) for Caratheodory recombination.

Parameters:

test_functions (Optional[Callable[[Array], Real[Array, 'm-1']]]) – A callable that applies a set of specified test-functions \(\Phi^\prime = \{\phi_1,\dots,\phi_{m-1}\}\) where each function is a map \(\phi_i \colon \Omega \to \mathbb{R}\); a value of non implies the identity map \(\Phi^\prime \colon x \mapsto x\), and necessarily assumes that \(x \in \Omega \subseteq \mathbb{R}^{m-1}\)
mode (Literal['implicit-explicit', 'implicit', 'explicit']) – ‘implicit-explicit’ explicitly removes \(n - m\) points, yielding a coreset of size \(m\), with \(m - m^\prime\) zero-weighted (implicitly removed) points; ‘implicit’ explicitly removes no points, yielding a coreset of size \(n\) with \(n - m^\prime\) zero-weighted (implicitly removed) points; ‘explicit’ explicitly removes \(n - m^\prime\) points, yielding a coreset of size \(m^\prime\), but unlike the other methods is not JIT compatible as the coreset size \(m^\prime\) is unknown at compile time.
rcond (Optional[float]) – A relative condition number; any singular value \(s\) below the threshold \(\text{rcond} * \text{max}(s)\) is treated as equal to zero; if rcond is None, it defaults to floating point eps * max(n, d)

reduce(dataset, solver_state=None)[source]¶

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:

dataset (Data) – The (potentially weighted and supervised) data to generate the coreset from
solver_state (None) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[Coresubset, None]

Returns:

a tuple of the solved coreset and intermediate solver state information

class coreax.solvers.TreeRecombination(test_functions=None, mode='implicit-explicit', rcond=None, tree_reduction_factor=2)[source]¶

Bases: RecombinationSolver[Data, None]

Tree recombination based coresubset solver.

Based on Algorithm 7 Chapter 3.3 of [tchernychova2016recombination], which is an order of magnitude more efficient than Algorithm 5 in Chapter 3.2, originally introduced in [litterer2012recombination].

The time complexity is of order \(\mathcal{O}(\log_2(\frac{n}{c_r m}) m^3)\), where :math`c_r` is the tree_reduction_factor. The time complexity can be equivalently expressed as \(\mathcal{O}(m^3)\), using the same arguments as used in CaratheodoryRecombination.

Note

As the ratio of \(n / m\) grows, the constant factor for the time complexity of TreeRecombination increases at a logarithmic rate, rather than at a quadratic rate for plain CaratheodoryRecombination. Hence, in general, we would expect TreeRecombination to be the more efficient choice for all but the smallest values of \(n / m\).

Parameters:

test_functions (Optional[Callable[[Array], Real[Array, 'm-1']]]) – the map \(\Phi^\prime = \{ \phi_1, \dots, \phi_{M-1} \}\) where each \(\phi_i \colon \Omega \to \mathbb{R}\) represents a linearly independent test-function; a value of None implies the identity function (necessarily assuming \(\Omega \subseteq \mathbb{R}^{M-1}\))
mode (Literal['implicit-explicit', 'implicit', 'explicit']) – ‘implicit-explicit’ explicitly removes \(n - m\) points, yielding a coreset of size \(m\), with \(m - m^\prime\) zero-weighted (implicitly removed) points; ‘implicit’ explicitly removes no points, yielding a coreset of size \(n\) with \(n - m^\prime\) zero-weighted (implicitly removed) points; ‘explicit’ explicitly removes \(n - m^\prime\) points, yielding a coreset of size \(m^\prime\), but unlike the other methods is not JIT compatible as the coreset size \(m^\prime\) is unknown at compile time.
rcond (Optional[float]) – a relative condition number; any singular value \(s\) below the threshold \(\text{rcond} * \text{max}(s)\) is treated as equal to zero; if rcond is None, it defaults to floating point eps * max(n, d)
tree_reduction_factor (int) – The factor by which each tree reduction step reduces the number of non-zero points; the remaining number of non-zero nodes, after performing recombination, is equal to n_nodes / tree_reduction_factor;

reduce(dataset, solver_state=None)[source]¶

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:

dataset (Data) – The (potentially weighted and supervised) data to generate the coreset from
solver_state (None) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[Coresubset, None]

Returns:

a tuple of the solved coreset and intermediate solver state information

class coreax.solvers.CompressPlusPlus(coreset_size, g, kernel, random_key, delta, sqrt_kernel)[source]¶

Bases: CoresubsetSolver[_Data, None], ExplicitSizeSolver

Compress++ - A hierarchical coreset construction solver.

CompressPlusPlus is an efficient method for building coresets without compromising performance significantly. It operates in two steps: recursively halving and thinning.

In the recursive halving step, the dataset is partitioned into four subsets. Each subset is further divided into four subsets, repeated a predefined number of times. After this, each subset is halved using KernelThinning. The halved subsets are concatenated bottom-up to form a coreset.

Finally, the resulting coreset is thinned again using KernelThinning to obtain a coreset of the desired size. This implementation is an adaptation of the Compress++ algorithm in [shetty2022compress] to make it an explicit sized solver.

Parameters:

g (int) – The oversampling factor.
kernel (ScalarValuedKernel) – A ScalarValuedKernel for kernel thinning.
random_key (Array) – A random number generator key for the kernel thinning solver, ensuring reproducibility of probabilistic components in the algorithm.
delta (float) – A float between 0 and 1, representing the swapping probability during the dataset splitting. A recommended value is \(1 / \log(\log(n))\), where \(n\) is the size of the original dataset.
sqrt_kernel (ScalarValuedKernel) – A ScalarValuedKernel instance defining the square root kernel used in the kernel thinning solver.
coreset_size (int)

reduce(dataset, solver_state=None)[source]¶

Reduce a dataset to a coreset of the desired size using a hierarchical approach.

This method reduces the dataset by recursively partitioning it and applying kernel thinning. The dataset is clipped to the nearest power of four before processing. The dataset is partitioned into four subsets. Each subset is further divided into four subsets, repeated until we have a coreset of predefined size. After this, the partitioned are recursively concatenated (4 at a time) and halved using reduce(). Finally, the resulting coreset is thinned again using reduce() to obtain a coreset of the desired size.

Parameters:

dataset (Any) – The original dataset to be reduced.
solver_state (None) – The state of the solver.

Return type:

tuple[Coresubset[Any], None]

Returns:

A tuple containing the final coreset and the solver state (None).