Coresets#
Module for defining coreset data structures.
- class coreax.coreset.Coreset(nodes, pre_coreset_data)[source]#
Bases:
Module,Generic[_Data]Data structure for representing a coreset.
TLDR: a coreset is a reduced set of \(\hat{n}\) (potentially weighted) data points that, in some sense, best represent the “important” properties of a larger set of \(n > \hat{n}\) (potentially weighted) data points.
Given a dataset \(X = \{x_i\}_{i=1}^n, x \in \Omega\), where each node is paired with a non-negative (probability) weight \(w_i \in \mathbb{R} \ge 0\), there exists an implied discrete (probability) measure over \(\Omega\)
\[\eta_n = \sum_{i=1}^{n} w_i \delta_{x_i}.\]If we then specify a set of test-functions \(\Phi = {\phi_1, \dots, \phi_M}\), where \(\phi_i \colon \Omega \to \mathbb{R}\), which somehow capture the “important” properties of the data, then there also exists an implied push-forward measure over \(\mathbb{R}^M\)
\[\mu_n = \sum_{i=1}^{n} w_i \delta_{\Phi(x_i)}.\]A coreset is simply a reduced measure containing \(\hat{n} < n\) updated nodes \(\hat{x}_i\) and weights \(\hat{w}_i\), such that the push-forward measure of the coreset \(\nu_\hat{n}\) has (approximately for some algorithms) the same “centre-of-mass” as the push-forward measure for the original data \(\mu_n\)
\[\text{CoM}(\mu_n) = \text{CoM}(\nu_\hat{n}), \text{CoM}(\nu_\hat{n}) = \int_\Omega \Phi(\omega) d\nu_\hat{x}(\omega), \text{CoM}(\nu_\hat{n}) = \sum_{i=1}^\hat{n} \hat{w}_i \delta_{\Phi(\hat{x}_i)}.\]Note
Depending on the algorithm, the test-functions may be explicitly specified by the user, or implicitly defined by the algorithm’s specific objectives.
- Parameters:
nodes (
Data) – The (weighted) coreset nodes, math:x_i in text{supp}(nu_hat{n}); once instantiated, the nodes should be accessed viaCoresubset.coreset()pre_coreset_data (
Data) – The dataset \(X\) used to construct the coreset.
- property coreset: _Data#
Materialised coreset.
- solve_weights(solver, **solver_kwargs)[source]#
Return a copy of ‘self’ with weights solved by ‘solver’.
- Return type:
Self- Parameters:
solver (WeightsOptimiser[_Data]) –
- class coreax.coreset.Coresubset(nodes, pre_coreset_data)[source]#
Bases:
Coreset[Data],Generic[_Data]Data structure for representing a coresubset.
A coresubset is a :class`Coreset`, with the additional condition that the support of the reduced measure (the coreset), must be a subset of the support of the original measure (the original data), such that
\[\hat{x}_i = x_i, \forall i \in I, I \subset \{1, \dots, n\}, text{card}(I) = \hat{n}.\]Thus, a coresubset, unlike a coreset, ensures that feasibility constraints on the support of the measure are maintained [litterer2012recombination]. This is vital if, for example, the test-functions are only defined on the support of the original measure/nodes, rather than all of \(\Omega\).
In coresubsets, the measure reduction can be implicit (setting weights/nodes to zero for all \(i \in I \ {1, \dots, n}\)) or explicit (removing entries from the weight/node arrays). The implicit approach is useful when input/output array shape stability is required (E.G. for some JAX transformations); the explicit approach is more similar to a standard coreset.
- Parameters:
nodes (
Data) – The (weighted) coresubset node indices, \(I\); the materialised coresubset nodes should be accessed viaCoresubset.coreset().pre_coreset_data (
Data) – The dataset \(X\) used to construct the coreset.
- property coreset: _Data#
Materialise the coresubset from the indices and original data.
- property unweighted_indices: Shaped[Array, 'n']#
Unweighted Coresubset indices - attribute access helper.