Solvers#
Solvers for generating coresets.
- class coreax.solvers.CompositeSolver(base_solver)[source]#
Base class for solvers that compose with/wrap other solvers.
- Parameters:
base_solver (Solver[_Coreset, _Data, _State]) –
- class coreax.solvers.CoresubsetSolver[source]#
Solver which returns a
coreax.coreset.Coresubset.A convenience class for the most common solver type in this package.
- class coreax.solvers.Solver[source]#
Base class for coreset solvers.
Solver is generic on the type of data required by the reduce method, and the type of coreset returned, providing a convenient means to distinguish between solvers that take (weighted) data/supervised data, and those which produce coresets/coresubsets.
- class coreax.solvers.RefinementSolver[source]#
A
CoresubsetSolverwhich supports refinement.Some solvers assume implicitly/explicitly an initial coresubset on which the solution is dependent. Such solvers can be interpreted as refining the initial coresubset to produce another (solution) coresubset.
By providing a ‘refine’ method, one can compose the results of different solvers together, and/or repeatedly apply/chain the result of a refinement based solve.
# An example of repeated application/chaining of solutions/solvers. result, state = solver.reduce(dataset) refined_result, state = refine_solver.refine(result, state) re_refined_result, state = refine_solver.refine(refined_result, state)
- abstract refine(coresubset, solver_state=None)[source]#
Refine a coresubset - swap/update coresubset indices.
- Parameters:
coresubset (
Coresubset) – The coresubset to refinesolver_state (
Optional[Any]) – Solution state information, primarily used to cache expensive intermediate solution step values.
- Return type:
- Returns:
a refined coresubset and relevant intermediate solver state information
- class coreax.solvers.ExplicitSizeSolver(coreset_size)[source]#
A
Solverwhich produces a coreset of an explicitly specified size.- Parameters:
coreset_size (
Any) – The desired size of the solved coreset
- class coreax.solvers.HerdingState(gramian_row_mean)[source]#
Intermediate
KernelHerdingsolver state information.- Parameters:
gramian_row_mean (
Array) – Cached Gramian row-mean.
- class coreax.solvers.KernelHerding(coreset_size, kernel, unique=True, block_size=None, unroll=1)[source]#
Kernel Herding - an explicitly sized coresubset refinement solver.
Solves the coresubset problem by taking a deterministic, iterative, and greedy approach to minimizing the (weighted) Maximum Mean Discrepancy (MMD) between the coresubset (the solution) and the problem dataset.
Given one has selected \(T\) data points for their compressed representation of the original dataset, kernel herding selects the next point using Equation 8 of [chen2012herding]:
\[x_{T+1} = \arg\max_{x} \left( \mathbb{E}[k(x, x')] - \frac{1}{T+1}\sum_{t=1}^T k(x, x_t) \right)\]where \(k\) is the kernel used, the expectation \(\mathbb{E}\) is taken over the entire dataset, and the search is over the entire dataset. This can informally be seen as a balance between using points at which the underlying density is high (the first term) and exploration of distinct regions of the space (the second term).
- Parameters:
coreset_size (
Any) – The desired size of the solved coresetkernel (
Kernel) –Kernelinstance implementing a kernel function \(k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}\)unique (
bool) – Boolean that ensures the resulting coresubset will only contain unique elementsblock_size (
Union[int,None,tuple[Optional[int],Optional[int]]]) – Block size passed tocompute_mean()unroll (
Union[int,bool,tuple[Union[int,bool],Union[int,bool]]]) – Unroll parameter passed tocompute_mean()
- reduce(dataset, solver_state=None)[source]#
Reduce ‘dataset’ to a coreset - solve the coreset problem.
- Parameters:
dataset (
Data) – The (potentially weighted and supervised) data to generate the coreset fromsolver_state (
Optional[HerdingState]) – Solution state information, primarily used to cache expensive intermediate solution step information
- Return type:
- Returns:
a tuple of the solved coreset and intermediate solver state information
- refine(coresubset, solver_state=None)[source]#
Refine a coresubset with ‘Kernel Herding’.
We first compute the kernel’s Gramian row-mean if it is not given in the ‘solver_state’, and then iteratively swap points with the initial coreset, balancing selecting points in high density regions with selecting points far from those already in the coreset.
- Parameters:
coresubset (
Coresubset[Data]) – The coresubset to refinesolver_state (
Optional[HerdingState]) – Solution state information, primarily used to cache expensive intermediate solution step values.
- Return type:
- Returns:
A refined coresubset and relevant intermediate solver state information
- class coreax.solvers.RandomSample(coreset_size, random_key, weighted=False, unique=True)[source]#
Reduce a dataset by randomly sampling a fixed number of points.
- Parameters:
- class coreax.solvers.RPCholesky(coreset_size, random_key, kernel, unique=True)[source]#
Randomly Pivoted Cholesky - an explicitly sized coresubset refinement solver.
Solves the coresubset problem by taking a stochastic, iterative, and greedy approach to approximating the Gramian of a given kernel, evaluated on the original dataset.
- Parameters:
coreset_size (
Any) – The desired size of the solved coresetrandom_key (
ArrayLike) – Key for random number generationkernel (
Kernel) –Kernelinstance implementing a kernel function \(k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}\)unique (
bool) – If each index in the resulting coresubset should be unique
- reduce(dataset, solver_state=None)[source]#
Reduce ‘dataset’ to a
Coresubsetwith ‘RPCholesky’.This is done by first computing the kernel Gram matrix of the original data, and isolating the diagonal of this. A ‘pivot point’ is then sampled, where sampling probabilities correspond to the size of the elements on this diagonal. The data-point corresponding to this pivot point is added to the coreset, and the diagonal of the Gram matrix is updated to add a repulsion term of sorts - encouraging the coreset to select a range of distinct points in the original data. The pivot sampling and diagonal updating steps are repeated until \(M\) points have been selected.
- Parameters:
dataset (
Data) – The dataset to reduce to a coresubsetsolver_state (
Optional[RPCholeskyState]) – Solution state information, primarily used to cache expensive intermediate solution step values.
- Return type:
- Returns:
a refined coresubset and relevant intermediate solver state information
- class coreax.solvers.RPCholeskyState(gramian_diagonal)[source]#
Intermediate
RPCholeskysolver state information.- Parameters:
gramian_diagonal (
Array) – Cached Gramian diagonal.
- class coreax.solvers.SteinThinning(coreset_size, kernel, score_matching=None, unique=True, regularise=True, block_size=None, unroll=1)[source]#
Stein Thinning - an explicitly sized coresubset refinement solver.
Solves the coresubset problem by taking a deterministic, iterative, and greedy approach to minimizing the Kernelised Stein Discrepancy (KSD) between the empirical distribution of the coresubset (the solution) and the distribution of the problem dataset, as characterised by the score function of the Stein Kernel.
Given one has selected \(T\) data points for their compressed representation of the original dataset, (regularised) Stein thinning selects the next point using the equations in Section 3.1 of [benard2023kernel]:
\[x_{T+1} = \arg\min_{x} \left( k_P(x, x) / 2 + \Delta^+ \log p(x) - \lambda T \log p(x) + \frac{1}{T+1}\sum_{t=1}^T k_P(x, x_t) \right)\]where \(k\) is the Stein kernel induced by the supplied base kernel, \(\Delta^+\) is the non-negative Laplace operator, \(\lambda\) is a regularisation parameter, and the search is over the entire dataset.
- Parameters:
coreset_size (
Any) – The desired size of the solved coresetkernel (
Kernel) –Kernelinstance implementing a kernel function \(k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}\); if ‘kernel’ is aSteinKernelandscore_matching is not None, a new instance of the kernel will be generated where the score function is given byscore_matching.match(...)score_matching (
Optional[ScoreMatching]) – Specifies/overwrite the score function of the implied/passedSteinKernel; ifNone, default toKernelDensityMatchingunless ‘kernel’ is aSteinKernel, in which case the kernel’s existing score function is used.unique (
bool) – If each index in the resulting coresubset should be uniqueregularise (
bool) – Boolean that enforces regularisation, see Section 3.1 of [benard2023kernel].block_size (
Union[int,None,tuple[Optional[int],Optional[int]]]) – Block size passed tocompute_mean()unroll (
Union[int,bool,tuple[Union[int,bool],Union[int,bool]]]) – Unroll parameter passed tocompute_mean()
- reduce(dataset, solver_state=None)[source]#
Reduce ‘dataset’ to a coreset - solve the coreset problem.
- Parameters:
- Return type:
tuple[Coresubset[Data],None]- Returns:
a tuple of the solved coreset and intermediate solver state information
- refine(coresubset, solver_state=None)[source]#
Refine a coresubset with ‘Stein thinning’.
We first compute a score function, and then the Stein kernel. This is used to greedily choose points in the coreset to minimise kernel Stein discrepancy (KSD).
- Parameters:
coresubset (
Coresubset[Data]) – The coresubset to refinesolver_state (
None) – Solution state information, primarily used to cache expensive intermediate solution step values.
- Return type:
tuple[Coresubset[Data],None]- Returns:
a refined coresubset and relevant intermediate solver state information
- class coreax.solvers.MapReduce(base_solver, leaf_size, tree_type=<class 'sklearn.neighbors._kd_tree.KDTree'>)[source]#
Calculate coreset of a given number of points using scalable reduction on blocks.
Uses a
KDTreeorBallTreeto partition the original data into patches. Upon each of these a coreset of sizebase_solver.coreset_sizeis calculated. These coresets are concatenated to produce a larger coreset covering the whole of the original data, which has size greater thancoreset_size. This coreset is now treated as the original data and reduced recursively until its size is equal tobase_solver.coreset_size.There is some intricate set-up:
base_solver.coreset_sizemust be less thanleaf_size.Zero weighted and valued padding will be used to ensure each partition has the same size if
len(dataset)is not an integer multiple ofleaf_size
Let \(n_k\) be the number of points after each recursion with \(n_0\) equal to the size of the original data. Then, each recursion reduces the size of the coreset such that
\[n_k <= \frac{n_{k - 1}}{\texttt{leaf_size}} \texttt{coreset_size},\]so
\[n_k <= \left( \frac{\texttt{coreset_size}}{\texttt{leaf_size}} \right)^k n_0.\]Thus, the number of iterations required is roughly (find \(k\) when \(n_k =\)
base_solver.coreset_size)\[\frac{ \log{\texttt{coreset_size}} - \log{\left(\text{original data size}\right)} }{ \log{\texttt{coreset_size}} - \log{\texttt{leaf_size}} } .\]- Parameters:
base_solver (
Solver[Any,Data,Any]) – Solver to compose with; full support is currently provided forcoreax.solvers.KernelHerdingandcoreax.solvers.SteinThinning; the solver’s result must be ignorant of/invariant to zero weighted padding.leaf_size (
Any) – Number of points to include in each partition; corresponds to2*leaf_sizeinsklearn.neighbors.BinaryTree; must be greater thanbase_solver.coreset_sizetree_type (
type[Union[KDTree,BallTree]]) – The type of binary tree based partitioning to use when splitting the dataset into smaller blocks.
- class coreax.solvers.PaddingInvariantSolver[source]#
A
Solverwhose results are invariant to zero weighted data.In some cases, such as in
coreax.solvers.MapReduce, there is a need to pad data to ensure shape stability. In some cases, we may assign zero weight to the padded data points, which allows certain ‘padding invariant’ solvers to return the same values on a call toreduce()as would have been returned if no padding were present.Inheriting from this class is only a promise by the solver to obey the invariance property. Conformity with the property is not checked at runtime.