Solvers¶
Solvers for generating coresets.
- class coreax.solvers.Solver[source]¶
Bases:
Module,Generic[_Coreset,_Data,_State]Base class for coreset solvers.
Solver is generic on the type of data required by the reduce method, and the type of coreset returned, providing a convenient means to distinguish between solvers that take (weighted) data/supervised data, and those which produce coresets/coresubsets.
- class coreax.solvers.CoresubsetSolver[source]¶
Bases:
Solver[Coresubset[_Data],_Data,_State],Generic[_Data,_State]Solver which returns a
coreax.coreset.Coresubset.A convenience class for the most common solver type in this package.
- class coreax.solvers.RefinementSolver[source]¶
Bases:
CoresubsetSolver[_Data,_State],Generic[_Data,_State]A
CoresubsetSolverwhich supports refinement.Some solvers assume implicitly/explicitly an initial coresubset on which the solution is dependent. Such solvers can be interpreted as refining the initial coresubset to produce another (solution) coresubset.
By providing a ‘refine’ method, one can compose the results of different solvers together, and/or repeatedly apply/chain the result of a refinement based solve.
# An example of repeated application/chaining of solutions/solvers. result, state = solver.reduce(dataset) refined_result, state = refine_solver.refine(result, state) re_refined_result, state = refine_solver.refine(refined_result, state)
- abstract refine(coresubset, solver_state=None)[source]¶
Refine a coresubset - swap/update coresubset indices.
- Parameters:
coresubset (
Coresubset[Any]) – The coresubset to refinesolver_state (
Optional[Any]) – Solution state information, primarily used to cache expensive intermediate solution step values.
- Return type:
tuple[Coresubset[Any],Any]- Returns:
a refined coresubset and relevant intermediate solver state information
- class coreax.solvers.ExplicitSizeSolver(coreset_size)[source]¶
Bases:
Solver[_Coreset,_Data,_State],Generic[_Coreset,_Data,_State]A
Solverwhich produces a coreset of an explicitly specified size.- Parameters:
coreset_size (
Any) – The desired size of the solved coreset
- class coreax.solvers.PaddingInvariantSolver[source]¶
Bases:
Solver[_Coreset,_Data,_State],Generic[_Coreset,_Data,_State]A
Solverwhose results are invariant to zero weighted data.In some cases, such as in
coreax.solvers.MapReduce, there is a need to pad data to ensure shape stability. In these cases, we may assign zero weight to the padded data points, which allows certain ‘padding invariant’ solvers to return the same values on a call toreduce()as would have been returned if no padding were present.Inheriting from this class is only a promise by the solver to obey the invariance property. Conformity with the property is not checked at runtime.
- class coreax.solvers.CompositeSolver(base_solver)[source]¶
Bases:
Solver[_Coreset,_Data,_State],Generic[_Coreset,_Data,_State]Base class for solvers that compose with/wrap other solvers.
- Parameters:
base_solver (Solver[_Coreset, _Data, _State])
- class coreax.solvers.MapReduce(base_solver, leaf_size, tree_type=<class 'sklearn.neighbors._kd_tree.KDTree'>)[source]¶
Bases:
CompositeSolver[_Coreset,_Data,_State],Generic[_Coreset,_Data,_State]Calculate coreset of a given number of points using scalable reduction on blocks.
Uses a
KDTreeorBallTreeto partition the original data into patches. Upon each of these a coreset of sizebase_solver.coreset_sizeis calculated. These coresets are concatenated to produce a larger coreset covering the whole of the original data, which has size greater thancoreset_size. This coreset is now treated as the original data and reduced recursively until its size is equal tobase_solver.coreset_size.There is some intricate set-up:
base_solver.coreset_sizemust be less thanleaf_size.Zero weighted and valued padding will be used to ensure each partition has the same size if
len(dataset)is not an integer multiple ofleaf_size
Let \(n_k\) be the number of points after each recursion with \(n_0\) equal to the size of the original data. Then, each recursion reduces the size of the coreset such that
\[n_k <= \frac{n_{k - 1}}{\texttt{leaf_size}} \texttt{coreset_size},\]so
\[n_k <= \left( \frac{\texttt{coreset_size}}{\texttt{leaf_size}} \right)^k n_0.\]Thus, the number of iterations required is roughly (find \(k\) when \(n_k =\)
base_solver.coreset_size)\[\frac{ \log{\texttt{coreset_size}} - \log{\left(\text{original data size}\right)} }{ \log{\texttt{coreset_size}} - \log{\texttt{leaf_size}} } .\]- Parameters:
base_solver (
Solver[AbstractCoreset,Any,Any]) – Solver to compose with; full support is currently provided forcoreax.solvers.KernelHerdingandcoreax.solvers.SteinThinning; the solver’s result must be ignorant of/invariant to zero weighted padding.leaf_size (
Any) – Number of points to include in each partition; corresponds to2*leaf_sizeinsklearn.neighbors.BinaryTree; must be greater thanbase_solver.coreset_sizetree_type (
type[Union[KDTree,BallTree]]) – The type of binary tree based partitioning to use when splitting the dataset into smaller blocks.
- class coreax.solvers.RandomSample(coreset_size, random_key, weighted=False, unique=True)[source]¶
Bases:
CoresubsetSolver[_Data,None],ExplicitSizeSolverReduce a dataset by randomly sampling a fixed number of points.
- Parameters:
- class coreax.solvers.HerdingState(gramian_row_mean)[source]¶
Bases:
ModuleIntermediate
KernelHerdingsolver state information.- Parameters:
gramian_row_mean (
Array) – Cached Gramian row-mean.
- class coreax.solvers.KernelHerding(coreset_size, kernel, unique=True, block_size=None, unroll=1, probabilistic=False, temperature=1.0, random_key=<factory>)[source]¶
Bases:
RefinementSolver[_Data,HerdingState],ExplicitSizeSolver,PaddingInvariantSolverKernel Herding - an explicitly sized coresubset refinement solver.
Solves the coresubset problem by taking a deterministic, iterative, and greedy approach to minimizing the (weighted) Maximum Mean Discrepancy (MMD) between the coresubset (the solution) and the problem dataset.
Given one has selected \(T\) data points for their compressed representation of the original dataset, kernel herding selects the next point using Equation 8 of [chen2012herding]:
\[x_{T+1} = \arg\max_{x} \left( \mathbb{E}[k(x, x')] - \frac{1}{T+1}\sum_{t=1}^T k(x, x_t) \right)\]where \(k\) is the kernel used, the expectation \(\mathbb{E}\) is taken over the entire dataset, and the search is over the entire dataset. This can informally be seen as a balance between using points at which the underlying density is high (the first term) and exploration of distinct regions of the space (the second term).
Optionally, the Kernel Herding procedure can be modified using the
probabilisticandtemperatureparameters in thereduceandrefinemethods. IfprobabilisticisTrue, a single point \(x\) at each iteration is selected with probability proportional to \(\text{softmax}(\frac{\text{KHMetric(x)}}{T})\), where \(\text{ KHMetric}\) is given above and \(T\) is thetemperatureparameter. As \(T \rightarrow \infty\), the probabilities become uniform, resulting in a random sampling. As \(T \rightarrow 0\), the probabilities become concentrated at the point with the highest metric, recovering the original Kernel Herding procedure. This feature is experimental and does not come from the original paper ([chen2012herding]).- Parameters:
coreset_size (
Any) – The desired size of the solved coresetkernel (
ScalarValuedKernel) –ScalarValuedKernelinstance implementing a kernel function \(k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}\)unique (
bool) – Boolean that ensures the resulting coresubset will only contain unique elementsblock_size (
Union[int,tuple[Optional[int],Optional[int]],None]) – Block size passed tocompute_mean()unroll (
Union[int,bool,tuple[Union[int,bool],Union[int,bool]]]) – Unroll parameter passed tocompute_mean()probabilistic (
bool) – If True, the elements are chosen probabilistically at each iteration. Otherwise, standard Kernel Herding is run.temperature (
Union[float,Shaped[Array, '']]) – Temperature parameter, which controls how uniform the probabilities are for probabilistic selection.random_key (
Array) – Key for random number generation, only used if probabilistic
- reduce(dataset, solver_state=None)[source]¶
Reduce ‘dataset’ to a coreset - solve the coreset problem.
- Parameters:
dataset (
Any) – The (potentially weighted and supervised) data to generate the coreset fromsolver_state (
Optional[HerdingState]) – Solution state information, primarily used to cache expensive intermediate solution step information
- Return type:
- Returns:
a tuple of the solved coreset and intermediate solver state information
- refine(coresubset, solver_state=None)[source]¶
Refine a coresubset with ‘Kernel Herding’.
We first compute the kernel’s Gramian row-mean if it is not given in the ‘solver_state’, and then iteratively swap points with the initial coreset, balancing selecting points in high density regions with selecting points far from those already in the coreset.
Warning
If the input
coresubsetis smaller than the requestedcoreset_size, it will be padded with zero-valued, zero-weighted indices. After the inputcoresubsethas been refined, new indices will be chosen to fill the padding. If the inputcoresubsetis larger than the requestedcoreset_size, the extra indices will not be optimised and will be clipped from the returncoresubset.- Parameters:
coresubset (
Coresubset[Any]) – The coresubset to refinesolver_state (
Optional[HerdingState]) – Solution state information, primarily used to cache expensive intermediate solution step values.
- Return type:
- Returns:
A refined coresubset and relevant intermediate solver state information
- class coreax.solvers.KernelThinning(coreset_size, kernel, random_key, delta, sqrt_kernel)[source]¶
Bases:
CoresubsetSolver[_Data,None],ExplicitSizeSolverKernel Thinning - a hierarchical coreset construction solver.
Kernel Thinning is a hierarchical, and probabilistic algorithm for coreset construction. It builds a coreset by splitting the dataset into several candidate coresets by repeatedly halving the dataset and applying probabilistic swapping. The best of these candidates (the one with the lowest MMD) is chosen which is further refined to minimise the Maximum Mean Discrepancy (MMD) between the original dataset and the coreset. This implementation is a modification of the Kernel Thinning algorithm in [dwivedi2024kernelthinning] to make it an ExplicitSizeSolver.
- Parameters:
kernel (
ScalarValuedKernel) – A ~coreax.kernels.ScalarValuedKernel instance defining the primary kernel function used for choosing the best coreset and refining it.random_key (
Array) – Key for random number generation, enabling reproducibility of probabilistic components in the algorithm.delta (
float) – A float between 0 and 1 used to compute the swapping probability during the splitting process. A recommended value is \(\frac{1}{n \log (\log n)}\), where \(n\) is the length of the original dataset.sqrt_kernel (
ScalarValuedKernel) – A ~coreax.kernels.ScalarValuedKernel instance representing the square root kernel used for splitting the original dataset.coreset_size (int)
- reduce(dataset, solver_state=None)[source]¶
Reduce ‘dataset’ to a
Coresubsetwith ‘KernelThinning’.This is done by first computing the number of halving steps required, referred to as m. The original data is clipped so that it is divisible by a power of two. The kernel halving algorithm is then recursively applied to halve the data.
Subsequently, a baseline_coreset is added to the ensemble of coresets. The best coreset is selected to minimise the Maximum Mean Discrepancy (MMD) and finally, it is refined further for optimal results. This final refinement step can reintroduce the clipped data dataset if they are found to reduce the MMD.
- kt_half_recursive(current_coreset, m, original_dataset)[source]¶
Recursively halve the original dataset into coresets.
- Parameters:
- Return type:
- Returns:
Fully partitioned list of coresets.
- kt_half(dataset)[source]¶
Partition the given dataset into two subsets.
First, initialise two coresubsets, each of which will contain half the points of the original dataset. Divide the points of the original dataset into pairs and probabilistically decide which point of the pair should go to which of the coresets. This function uses variables such as a, b, sigma, and delta, they refer to the corresponding parameters in the paper [dwivedi2024kernelthinning].
- Parameters:
dataset (
Any) – The input dataset to be halved.- Return type:
tuple[Coresubset[Any],Coresubset[Any]]- Returns:
A list containing the two partitioned coresets.
- get_baseline_coreset(dataset, baseline_coreset_size)[source]¶
Generate a baseline coreset by randomly sampling from the dataset.
- Parameters:
- Return type:
- Returns:
A randomly sampled baseline coreset with the specified size.
- kt_choose(candidate_coresets, points)[source]¶
Select the best coreset from a list of candidate coresets based on MMD.
- Parameters:
candidate_coresets (
list[Coresubset[Any]]) – A list of candidate coresets to be evaluated.points (
Any) – The original dataset against which the coresets are compared.
- Return type:
Shaped[Array, 'coreset_size']- Returns:
The coreset with the smallest MMD relative to the input dataset.
- kt_refine(candidate_coreset)[source]¶
Refine the selected candidate coreset.
Use
refine()which achieves the result of looping through each element in coreset replacing that element with a point in the original dataset to minimise MMD in each step.- Parameters:
candidate_coreset (
Coresubset[Any]) – The candidate coreset to be refined.- Return type:
tuple[Coresubset[Any],None]- Returns:
The refined coreset.
- class coreax.solvers.SteinThinning(coreset_size, kernel, score_matching=None, unique=True, regularise=True, regulariser_lambda=None, block_size=None, unroll=1)[source]¶
Bases:
RefinementSolver[_Data,None],ExplicitSizeSolver,PaddingInvariantSolverStein Thinning - an explicitly sized coresubset refinement solver.
Solves the coresubset problem by taking a deterministic, iterative, and greedy approach to minimizing the Kernelised Stein Discrepancy (KSD) between the empirical distribution of the coresubset (the solution) and the distribution of the problem dataset, as characterised by the score function of the Stein Kernel.
Given one has selected \(t-1\) data points for their compressed representation of the original dataset, (regularised) Stein thinning selects the next point using the equations in Section 3.1 of [benard2023kernel]:
\[x_{t} = \arg\min_{x} \left( k_p(x, x) + \Delta^+ \log p(x) - \lambda t \log p(x) + 2 \sum_{j=1}^{t-1} k_p(x, x_j) \right)\]where \(k\) is the Stein kernel induced by the supplied base kernel, \(\Delta^+\) is the non-negative Laplace operator, \(\lambda\) is a regularisation parameter, and the search is over the entire dataset.
- Parameters:
coreset_size (
Any) – The desired size of the solved coresetkernel (
ScalarValuedKernel) –ScalarValuedKernelinstance implementing a kernel function \(k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}\); if ‘kernel’ is aSteinKernelandscore_matching is notNone, a new instance of the kernel will be generated where the score function is given byscore_matching.match(...)score_matching (
Optional[ScoreMatching]) – Specifies/overwrite the score function of the implied/passedSteinKernel; ifNone, default toKernelDensityMatchingunless ‘kernel’ is aSteinKernel, in which case the kernel’s existing score function is used.unique (
bool) – If each index in the resulting coresubset should be uniqueregularise (
bool) – Boolean that enforces regularisation, see Section 3.1 of [benard2023kernel].regulariser_lambda (
Optional[float]) – The entropic regularisation parameter, \(\lambda\). IfNone, defaults to \(1/\text{coreset_size}\) following [benard2023kernel].block_size (
Union[int,tuple[Optional[int],Optional[int]],None]) – Block size passed tocompute_mean()unroll (
Union[int,bool,tuple[Union[int,bool],Union[int,bool]]]) – Unroll parameter passed tocompute_mean()
- reduce(dataset, solver_state=None)[source]¶
Reduce ‘dataset’ to a coreset - solve the coreset problem.
- Parameters:
- Return type:
tuple[Coresubset[Any],None]- Returns:
a tuple of the solved coreset and intermediate solver state information
- refine(coresubset, solver_state=None)[source]¶
Refine a coresubset with ‘Stein thinning’.
We first compute a score function, and then the Stein kernel. This is used to greedily choose points in the coreset to minimise kernel Stein discrepancy (KSD).
Note
Only the score function, \(\nabla \log p(x)\), is provided to the solver. Since the lambda regularisation term relies on the density, \(p(x)\), directly, it is estimated using a Gaussian kernel density estimator.
- Parameters:
coresubset (
Coresubset[Any]) – The coresubset to refinesolver_state (
None) – Solution state information, primarily used to cache expensive intermediate solution step values.
- Return type:
tuple[Coresubset[Any],None]- Returns:
a refined coresubset and relevant intermediate solver state information
- class coreax.solvers.RPCholeskyState(gramian_diagonal)[source]¶
Bases:
ModuleIntermediate
RPCholeskysolver state information.- Parameters:
gramian_diagonal (
Array) – Cached Gramian diagonal.
- class coreax.solvers.RPCholesky(coreset_size, random_key, kernel, unique=True)[source]¶
Bases:
CoresubsetSolver[_Data,RPCholeskyState],ExplicitSizeSolverRandomly Pivoted Cholesky - an explicitly sized coresubset refinement solver.
Solves the coresubset problem by taking a stochastic, iterative, and greedy approach to approximating the Gramian of a given kernel, evaluated on the original dataset.
- Parameters:
coreset_size (
Any) – The desired size of the solved coresetrandom_key (
Array) – Key for random number generationkernel (
ScalarValuedKernel) –ScalarValuedKernelinstance implementing a kernel function \(k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}\)unique (
bool) – If each index in the resulting coresubset should be unique
- reduce(dataset, solver_state=None)[source]¶
Reduce ‘dataset’ to a
Coresubsetwith ‘RPCholesky’.This is done by first computing the kernel Gram matrix of the original data, and isolating the diagonal of this. A ‘pivot point’ is then sampled, where sampling probabilities correspond to the size of the elements on this diagonal. The data-point corresponding to this pivot point is added to the coreset, and the diagonal of the Gram matrix is updated to add a repulsion term of sorts - encouraging the coreset to select a range of distinct points in the original data. The pivot sampling and diagonal updating steps are repeated until \(M\) points have been selected.
Note
The RPCholesky algorithm sometimes converges before reaching
self.coreset_size. In this case, the other coreset points are chosen uniformly at random, avoiding repetition.- Parameters:
dataset (
Any) – Dataset to reduce to a coresubset.solver_state (
Optional[RPCholeskyState]) – Solution state information, primarily used to cache expensive intermediate solution step values.
- Return type:
- Returns:
Refined coresubset and relevant intermediate solver state information.
- class coreax.solvers.GreedyKernelPointsState(feature_gramian)[source]¶
Bases:
ModuleIntermediate
GreedyKernelPointssolver state information.- Parameters:
feature_gramian (
Array) – Cached feature kernel gramian matrix, should be padded with an additional row and column of zeros.
- class coreax.solvers.GreedyKernelPoints(coreset_size, random_key, feature_kernel, regularisation_parameter=1e-06, unique=True, batch_size=None, least_squares_solver=None)[source]¶
Bases:
RefinementSolver[SupervisedData,GreedyKernelPointsState],ExplicitSizeSolverApply GreedyKernelPoints to a supervised dataset.
GreedyKernelPoints is a deterministic, iterative and greedy approach to build a coreset adapted from the inducing point method developed in [nguyen2021meta].
Given one has an original dataset \(\mathcal{D}^{(1)} = \{(x_i, y_i)\}_{i=1}^n\) of \(n\) pairs with \(x\in\mathbb{R}^d\) and \(y\in\mathbb{R}^p\), and one has selected \(m\) data pairs \(\mathcal{D}^{(2)} = \{(\tilde{x}_i, \tilde{y}_i)\}_{i=1}^m\) already for their compressed representation of the original dataset, GreedyKernelPoints selects the next point to minimise the loss
\[L(\mathcal{D}^{(1)}, \mathcal{D}^{(2)}) = ||y^{(1)} - K^{(12)}(K^{(22)} + \lambda I_m)^{-1}y^{(2)} ||^2_{\mathbb{R}^n}\]where \(y^{(1)}\in\mathbb{R}^n\) is the vector of responses from \(\mathcal{D}^{(1)}\), \(y^{(2)}\in\mathbb{R}^n\) is the vector of responses from \(\mathcal{D}^{(2)}\), \(K^{(12)} \in \mathbb{R}^{n\times m}\) is the cross-matrix of kernel evaluations between \(\mathcal{D}^{(1)}\) and \(\mathcal{D}^{(2)}\), \(K^{(22)} \in \mathbb{R}^{m\times m}\) is the kernel matrix on \(\mathcal{D}^{(2)}\), \(\lambda I_m \in \mathbb{R}^{m \times m}\) is the identity matrix and \(\lambda \in \mathbb{R}_{>0}\) is a regularisation parameter.
This class works with all children of
ScalarValuedKernel. Note that GreedyKernelPoints does not support non-uniform weights and will only return coresubsets with uniform weights.- Parameters:
random_key (
Array) – Key for random number generationfeature_kernel (
ScalarValuedKernel) –ScalarValuedKernelinstance implementing a kernel function \(k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}\) on the feature spaceregularisation_parameter (
float) – Regularisation parameter for stable inversion of the feature Gramianunique (
bool) – IfFalse, the resulting coresubset may contain the same point multiple times. IfTrue(default), the resulting coresubset will not contain any duplicate pointsbatch_size (
Optional[int]) – An integer representing the size of the batches of data pairs sampled at each iteration for consideration for adding to the coreset. IfNone(default), the search is performed over the entire datasetleast_squares_solver (
Optional[RegularisedLeastSquaresSolver]) – Instance ofcoreax.least_squares.RegularisedLeastSquaresSolver, default value ofNoneusescoreax.least_squares.MinimalEuclideanNormSolvercoreset_size (int)
- reduce(dataset, solver_state=None)[source]¶
Reduce ‘dataset’ to a coreset - solve the coreset problem.
- Parameters:
dataset (
SupervisedData) – The (potentially weighted and supervised) data to generate the coreset fromsolver_state (
Optional[GreedyKernelPointsState]) – Solution state information, primarily used to cache expensive intermediate solution step information
- Return type:
- Returns:
a tuple of the solved coreset and intermediate solver state information
- refine(coresubset, solver_state=None)[source]¶
Refine a coresubset with ‘GreedyKernelPointsState’.
We first compute the various factors if they are not given in the solver_state, and then iteratively swap points with the initial coreset, selecting points which minimise the loss.
Warning
If the input
coresubsetis smaller than the requestedcoreset_size, it will be padded with indices that are invariant with respect to the loss. After the inputcoresubsethas been refined, new indices will be chosen to fill the padding. If the inputcoresubsetis larger than the requestedcoreset_size, the extra indices will not be optimised and will be clipped from the returncoresubset.- Parameters:
coresubset (
Coresubset[SupervisedData]) – The coresubset to refinesolver_state (
Optional[GreedyKernelPointsState]) – Solution state information, primarily used to cache expensive intermediate solution step values.
- Return type:
- Returns:
A refined coresubset and relevant intermediate solver state information
- class coreax.solvers.RecombinationSolver(test_functions=None, mode='implicit-explicit')[source]¶
Bases:
CoresubsetSolver[_Data,_State],Generic[_Data,_State]Solver which returns a
Coresubsetvia recombination.Given \(m-1\) explicitly provided test-functions \(\Phi^\prime\), a recombination solver finds a coresubset with \(m^\prime \le m\) points, whose push-forward \(\hat{\mu}_{m^\prime}\) has the same “centre-of-mass” (CoM) as the dataset push-forward \(\mu_n := \Phi_* \nu_n\).
- Parameters:
test_functions (
Optional[Callable[[Array],Real[Array, 'm-1']]]) – A callable that applies a set of specified test-functions \(\Phi^\prime = \{\phi_1,\dots,\phi_{m-1}\}\) where each function is a map \(\phi_i \colon \Omega\to\mathbb{R}\); a value ofNoneimplies the identity map \(\Phi^\prime \colon x \mapsto x\), and necessarily assumes that \(x \in \Omega \subseteq \mathbb{R}^{m-1}\)mode (
Literal['implicit-explicit','implicit','explicit']) – ‘implicit-explicit’ explicitly removes \(n - m\) points, yielding a coreset of size \(m\), with \(m - m^\prime\) zero-weighted (implicitly removed) points; ‘implicit’ explicitly removes no points, yielding a coreset of size \(n\) with \(n - m^\prime\) zero-weighted (implicitly removed) points; ‘explicit’ explicitly removes \(n - m^\prime\) points, yielding a coreset of size \(m^\prime\), but unlike the other methods is not JIT compatible as the coreset size \(m^\prime\) is unknown at compile time.
- class coreax.solvers.CaratheodoryRecombination(test_functions=None, mode='implicit-explicit', rcond=None)[source]¶
Bases:
RecombinationSolver[Data,None]Recombination via Caratheodory measure reduction (Gaussian-Elimination).
Proposed in [tchernychova2016recombination] (see Chapter 1.3.3.3) as an alternative to the Simplex algorithm for solving the recombination problem.
Unlike the Simplex method, with time complexity \(\mathcal{O}(m^3 n + m n^2)\), Caratheodory recombination has time complexity of only \(\mathcal{O}(m n^2)\).
Note
Given \(n = cm\), for a rational constant \(c\), the above complexities can be alternatively represented as \(\mathcal{O}(m^4)\) for the Simplex method and \(\mathcal{O}(m^3)\) for Caratheodory recombination.
- Parameters:
test_functions (
Optional[Callable[[Array],Real[Array, 'm-1']]]) – A callable that applies a set of specified test-functions \(\Phi^\prime = \{\phi_1,\dots,\phi_{m-1}\}\) where each function is a map \(\phi_i \colon \Omega \to \mathbb{R}\); a value of non implies the identity map \(\Phi^\prime \colon x \mapsto x\), and necessarily assumes that \(x \in \Omega \subseteq \mathbb{R}^{m-1}\)mode (
Literal['implicit-explicit','implicit','explicit']) – ‘implicit-explicit’ explicitly removes \(n - m\) points, yielding a coreset of size \(m\), with \(m - m^\prime\) zero-weighted (implicitly removed) points; ‘implicit’ explicitly removes no points, yielding a coreset of size \(n\) with \(n - m^\prime\) zero-weighted (implicitly removed) points; ‘explicit’ explicitly removes \(n - m^\prime\) points, yielding a coreset of size \(m^\prime\), but unlike the other methods is not JIT compatible as the coreset size \(m^\prime\) is unknown at compile time.rcond (
Optional[float]) – A relative condition number; any singular value \(s\) below the threshold \(\text{rcond} * \text{max}(s)\) is treated as equal to zero; if rcond isNone, it defaults to floating point eps * max(n, d)
- class coreax.solvers.TreeRecombination(test_functions=None, mode='implicit-explicit', rcond=None, tree_reduction_factor=2)[source]¶
Bases:
RecombinationSolver[Data,None]Tree recombination based coresubset solver.
Based on Algorithm 7 Chapter 3.3 of [tchernychova2016recombination], which is an order of magnitude more efficient than Algorithm 5 in Chapter 3.2, originally introduced in [litterer2012recombination].
The time complexity is of order \(\mathcal{O}(\log_2(\frac{n}{c_r m}) m^3)\), where :math`c_r` is the tree_reduction_factor. The time complexity can be equivalently expressed as \(\mathcal{O}(m^3)\), using the same arguments as used in
CaratheodoryRecombination.Note
As the ratio of \(n / m\) grows, the constant factor for the time complexity of
TreeRecombinationincreases at a logarithmic rate, rather than at a quadratic rate for plainCaratheodoryRecombination. Hence, in general, we would expectTreeRecombinationto be the more efficient choice for all but the smallest values of \(n / m\).- Parameters:
test_functions (
Optional[Callable[[Array],Real[Array, 'm-1']]]) – the map \(\Phi^\prime = \{ \phi_1, \dots, \phi_{M-1} \}\) where each \(\phi_i \colon \Omega \to \mathbb{R}\) represents a linearly independent test-function; a value ofNoneimplies the identity function (necessarily assuming \(\Omega \subseteq \mathbb{R}^{M-1}\))mode (
Literal['implicit-explicit','implicit','explicit']) – ‘implicit-explicit’ explicitly removes \(n - m\) points, yielding a coreset of size \(m\), with \(m - m^\prime\) zero-weighted (implicitly removed) points; ‘implicit’ explicitly removes no points, yielding a coreset of size \(n\) with \(n - m^\prime\) zero-weighted (implicitly removed) points; ‘explicit’ explicitly removes \(n - m^\prime\) points, yielding a coreset of size \(m^\prime\), but unlike the other methods is not JIT compatible as the coreset size \(m^\prime\) is unknown at compile time.rcond (
Optional[float]) – a relative condition number; any singular value \(s\) below the threshold \(\text{rcond} * \text{max}(s)\) is treated as equal to zero; ifrcond is None, it defaults to floating point eps * max(n, d)tree_reduction_factor (
int) – The factor by which each tree reduction step reduces the number of non-zero points; the remaining number of non-zero nodes, after performing recombination, is equal to n_nodes / tree_reduction_factor;
- class coreax.solvers.CompressPlusPlus(coreset_size, g, kernel, random_key, delta, sqrt_kernel)[source]¶
Bases:
CoresubsetSolver[_Data,None],ExplicitSizeSolverCompress++ - A hierarchical coreset construction solver.
CompressPlusPlus is an efficient method for building coresets without compromising performance significantly. It operates in two steps: recursively halving and thinning.
In the recursive halving step, the dataset is partitioned into four subsets. Each subset is further divided into four subsets, repeated a predefined number of times. After this, each subset is halved using
KernelThinning. The halved subsets are concatenated bottom-up to form a coreset.Finally, the resulting coreset is thinned again using
KernelThinningto obtain a coreset of the desired size. This implementation is an adaptation of the Compress++ algorithm in [shetty2022compress] to make it an explicit sized solver.- Parameters:
g (
int) – The oversampling factor.kernel (
ScalarValuedKernel) – AScalarValuedKernelfor kernel thinning.random_key (
Array) – A random number generator key for the kernel thinning solver, ensuring reproducibility of probabilistic components in the algorithm.delta (
float) – A float between 0 and 1, representing the swapping probability during the dataset splitting. A recommended value is \(1 / \log(\log(n))\), where \(n\) is the size of the original dataset.sqrt_kernel (
ScalarValuedKernel) – AScalarValuedKernelinstance defining the square root kernel used in the kernel thinning solver.coreset_size (int)
- reduce(dataset, solver_state=None)[source]¶
Reduce a dataset to a coreset of the desired size using a hierarchical approach.
This method reduces the dataset by recursively partitioning it and applying kernel thinning. The dataset is clipped to the nearest power of four before processing. The dataset is partitioned into four subsets. Each subset is further divided into four subsets, repeated until we have a coreset of predefined size. After this, the partitioned are recursively concatenated (4 at a time) and halved using
reduce(). Finally, the resulting coreset is thinned again usingreduce()to obtain a coreset of the desired size.