Solvers#

Solvers for generating coresets.

class coreax.solvers.CompositeSolver(base_solver)[source]#

Base class for solvers that compose with/wrap other solvers.

Parameters:: base_solver (Solver[_Coreset, _Data, _State]) –

class coreax.solvers.CoresubsetSolver[source]#

Solver which returns a coreax.coreset.Coresubset.

A convenience class for the most common solver type in this package.

class coreax.solvers.Solver[source]#

Base class for coreset solvers.

Solver is generic on the type of data required by the reduce method, and the type of coreset returned, providing a convenient means to distinguish between solvers that take (weighted) data/supervised data, and those which produce coresets/coresubsets.

abstract reduce(dataset, solver_state=None)[source]#

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:

dataset (Data) – The (potentially weighted and supervised) data to generate the coreset from
solver_state (Optional[Any]) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[Coreset, Any]

Returns:

a tuple of the solved coreset and intermediate solver state information

class coreax.solvers.RefinementSolver[source]#

A CoresubsetSolver which supports refinement.

Some solvers assume implicitly/explicitly an initial coresubset on which the solution is dependent. Such solvers can be interpreted as refining the initial coresubset to produce another (solution) coresubset.

By providing a ‘refine’ method, one can compose the results of different solvers together, and/or repeatedly apply/chain the result of a refinement based solve.

# An example of repeated application/chaining of solutions/solvers.
result, state = solver.reduce(dataset)
refined_result, state = refine_solver.refine(result, state)
re_refined_result, state = refine_solver.refine(refined_result, state)

abstract refine(coresubset, solver_state=None)[source]#

Refine a coresubset - swap/update coresubset indices.

Parameters:

coresubset (Coresubset) – The coresubset to refine
solver_state (Optional[Any]) – Solution state information, primarily used to cache expensive intermediate solution step values.

Return type:

tuple[Coresubset, Any]

Returns:

a refined coresubset and relevant intermediate solver state information

class coreax.solvers.ExplicitSizeSolver(coreset_size)[source]#

A Solver which produces a coreset of an explicitly specified size.

Parameters:: coreset_size (Any) – The desired size of the solved coreset

class coreax.solvers.HerdingState(gramian_row_mean)[source]#

Intermediate KernelHerding solver state information.

Parameters:: gramian_row_mean (Array) – Cached Gramian row-mean.

class coreax.solvers.KernelHerding(coreset_size, kernel, unique=True, block_size=None, unroll=1)[source]#

Kernel Herding - an explicitly sized coresubset refinement solver.

Solves the coresubset problem by taking a deterministic, iterative, and greedy approach to minimizing the (weighted) Maximum Mean Discrepancy (MMD) between the coresubset (the solution) and the problem dataset.

Given one has selected \(T\) data points for their compressed representation of the original dataset, kernel herding selects the next point using Equation 8 of [chen2012herding]:

\[x_{T+1} = \arg\max_{x} \left( \mathbb{E}[k(x, x')] - \frac{1}{T+1}\sum_{t=1}^T k(x, x_t) \right)\]

where \(k\) is the kernel used, the expectation \(\mathbb{E}\) is taken over the entire dataset, and the search is over the entire dataset. This can informally be seen as a balance between using points at which the underlying density is high (the first term) and exploration of distinct regions of the space (the second term).

Parameters:

coreset_size (Any) – The desired size of the solved coreset
kernel (Kernel) – Kernel instance implementing a kernel function \(k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}\)
unique (bool) – Boolean that ensures the resulting coresubset will only contain unique elements
block_size (Union[int, None, tuple[Optional[int], Optional[int]]]) – Block size passed to compute_mean()
unroll (Union[int, bool, tuple[Union[int, bool], Union[int, bool]]]) – Unroll parameter passed to compute_mean()

reduce(dataset, solver_state=None)[source]#

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:

dataset (Data) – The (potentially weighted and supervised) data to generate the coreset from
solver_state (Optional[HerdingState]) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[Coresubset[Data], HerdingState]

Returns:

a tuple of the solved coreset and intermediate solver state information

refine(coresubset, solver_state=None)[source]#

Refine a coresubset with ‘Kernel Herding’.

We first compute the kernel’s Gramian row-mean if it is not given in the ‘solver_state’, and then iteratively swap points with the initial coreset, balancing selecting points in high density regions with selecting points far from those already in the coreset.

Parameters:

coresubset (Coresubset[Data]) – The coresubset to refine
solver_state (Optional[HerdingState]) – Solution state information, primarily used to cache expensive intermediate solution step values.

Return type:

tuple[Coresubset[Data], HerdingState]

Returns:

A refined coresubset and relevant intermediate solver state information

class coreax.solvers.RandomSample(coreset_size, random_key, weighted=False, unique=True)[source]#

Reduce a dataset by randomly sampling a fixed number of points.

Parameters:

coreset_size (Any) – The desired size of the solved coreset
random_key (ArrayLike) – Key for random number generation
weighted (bool) – If to use dataset weights as selection probabilities
unique (bool) – If to sample without replacement

reduce(dataset, solver_state=None)[source]#

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:

dataset (Data) – The (potentially weighted and supervised) data to generate the coreset from
solver_state (None) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[Coresubset, None]

Returns:

a tuple of the solved coreset and intermediate solver state information

class coreax.solvers.RPCholesky(coreset_size, random_key, kernel, unique=True)[source]#

Randomly Pivoted Cholesky - an explicitly sized coresubset refinement solver.

Solves the coresubset problem by taking a stochastic, iterative, and greedy approach to approximating the Gramian of a given kernel, evaluated on the original dataset.

Parameters:

coreset_size (Any) – The desired size of the solved coreset
random_key (ArrayLike) – Key for random number generation
kernel (Kernel) – Kernel instance implementing a kernel function \(k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}\)
unique (bool) – If each index in the resulting coresubset should be unique

reduce(dataset, solver_state=None)[source]#

Reduce ‘dataset’ to a Coresubset with ‘RPCholesky’.

This is done by first computing the kernel Gram matrix of the original data, and isolating the diagonal of this. A ‘pivot point’ is then sampled, where sampling probabilities correspond to the size of the elements on this diagonal. The data-point corresponding to this pivot point is added to the coreset, and the diagonal of the Gram matrix is updated to add a repulsion term of sorts - encouraging the coreset to select a range of distinct points in the original data. The pivot sampling and diagonal updating steps are repeated until \(M\) points have been selected.

Parameters:

dataset (Data) – The dataset to reduce to a coresubset
solver_state (Optional[RPCholeskyState]) – Solution state information, primarily used to cache expensive intermediate solution step values.

Return type:

tuple[Coresubset[Data], RPCholeskyState]

Returns:

a refined coresubset and relevant intermediate solver state information

class coreax.solvers.RPCholeskyState(gramian_diagonal)[source]#

Intermediate RPCholesky solver state information.

Parameters:: gramian_diagonal (Array) – Cached Gramian diagonal.

class coreax.solvers.SteinThinning(coreset_size, kernel, score_matching=None, unique=True, regularise=True, block_size=None, unroll=1)[source]#

Stein Thinning - an explicitly sized coresubset refinement solver.

Solves the coresubset problem by taking a deterministic, iterative, and greedy approach to minimizing the Kernelised Stein Discrepancy (KSD) between the empirical distribution of the coresubset (the solution) and the distribution of the problem dataset, as characterised by the score function of the Stein Kernel.

Given one has selected \(T\) data points for their compressed representation of the original dataset, (regularised) Stein thinning selects the next point using the equations in Section 3.1 of [benard2023kernel]:

\[x_{T+1} = \arg\min_{x} \left( k_P(x, x) / 2 + \Delta^+ \log p(x) - \lambda T \log p(x) + \frac{1}{T+1}\sum_{t=1}^T k_P(x, x_t) \right)\]

where \(k\) is the Stein kernel induced by the supplied base kernel, \(\Delta^+\) is the non-negative Laplace operator, \(\lambda\) is a regularisation parameter, and the search is over the entire dataset.

Parameters:

coreset_size (Any) – The desired size of the solved coreset
kernel (Kernel) – Kernel instance implementing a kernel function \(k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}\); if ‘kernel’ is a SteinKernel and score_matching is not None, a new instance of the kernel will be generated where the score function is given by score_matching.match(...)
score_matching (Optional[ScoreMatching]) – Specifies/overwrite the score function of the implied/passed SteinKernel; if None, default to KernelDensityMatching unless ‘kernel’ is a SteinKernel, in which case the kernel’s existing score function is used.
unique (bool) – If each index in the resulting coresubset should be unique
regularise (bool) – Boolean that enforces regularisation, see Section 3.1 of [benard2023kernel].
block_size (Union[int, None, tuple[Optional[int], Optional[int]]]) – Block size passed to compute_mean()
unroll (Union[int, bool, tuple[Union[int, bool], Union[int, bool]]]) – Unroll parameter passed to compute_mean()

reduce(dataset, solver_state=None)[source]#

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:

dataset (Data) – The (potentially weighted and supervised) data to generate the coreset from
solver_state (None) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[Coresubset[Data], None]

Returns:

a tuple of the solved coreset and intermediate solver state information

refine(coresubset, solver_state=None)[source]#

Refine a coresubset with ‘Stein thinning’.

We first compute a score function, and then the Stein kernel. This is used to greedily choose points in the coreset to minimise kernel Stein discrepancy (KSD).

Parameters:

coresubset (Coresubset[Data]) – The coresubset to refine
solver_state (None) – Solution state information, primarily used to cache expensive intermediate solution step values.

Return type:

tuple[Coresubset[Data], None]

Returns:

a refined coresubset and relevant intermediate solver state information

class coreax.solvers.MapReduce(base_solver, leaf_size, tree_type=<class 'sklearn.neighbors._kd_tree.KDTree'>)[source]#

Calculate coreset of a given number of points using scalable reduction on blocks.

Uses a KDTree or BallTree to partition the original data into patches. Upon each of these a coreset of size base_solver.coreset_size is calculated. These coresets are concatenated to produce a larger coreset covering the whole of the original data, which has size greater than coreset_size. This coreset is now treated as the original data and reduced recursively until its size is equal to base_solver.coreset_size.

There is some intricate set-up:

base_solver.coreset_size must be less than leaf_size.
Zero weighted and valued padding will be used to ensure each partition has the same size if len(dataset) is not an integer multiple of leaf_size

Let \(n_k\) be the number of points after each recursion with \(n_0\) equal to the size of the original data. Then, each recursion reduces the size of the coreset such that

\[n_k <= \frac{n_{k - 1}}{\texttt{leaf_size}} \texttt{coreset_size},\]

\[n_k <= \left( \frac{\texttt{coreset_size}}{\texttt{leaf_size}} \right)^k n_0.\]

Thus, the number of iterations required is roughly (find \(k\) when \(n_k =\) base_solver.coreset_size)

\[\frac{ \log{\texttt{coreset_size}} - \log{\left(\text{original data size}\right)} }{ \log{\texttt{coreset_size}} - \log{\texttt{leaf_size}} } .\]

Parameters:

base_solver (Solver[Any, Data, Any]) – Solver to compose with; full support is currently provided for coreax.solvers.KernelHerding and coreax.solvers.SteinThinning; the solver’s result must be ignorant of/invariant to zero weighted padding.
leaf_size (Any) – Number of points to include in each partition; corresponds to 2*leaf_size in sklearn.neighbors.BinaryTree; must be greater than base_solver.coreset_size
tree_type (type[Union[KDTree, BallTree]]) – The type of binary tree based partitioning to use when splitting the dataset into smaller blocks.

tree_type#: alias of KDTree

reduce(dataset, solver_state=None)[source]#

Reduce ‘dataset’ to a coreset - solve the coreset problem.

Parameters:

dataset (Data) – The (potentially weighted and supervised) data to generate the coreset from
solver_state (Optional[Any]) – Solution state information, primarily used to cache expensive intermediate solution step information

Return type:

tuple[Any, Any]

Returns:

a tuple of the solved coreset and intermediate solver state information

class coreax.solvers.PaddingInvariantSolver[source]#

A Solver whose results are invariant to zero weighted data.

In some cases, such as in coreax.solvers.MapReduce, there is a need to pad data to ensure shape stability. In some cases, we may assign zero weight to the padded data points, which allows certain ‘padding invariant’ solvers to return the same values on a call to reduce() as would have been returned if no padding were present.

Inheriting from this class is only a promise by the solver to obey the invariance property. Conformity with the property is not checked at runtime.