Coresets#

Module for defining coreset data structures.

class coreax.coreset.Coreset(nodes, pre_coreset_data)[source]#

Bases: Module, Generic[_Data]

Data structure for representing a coreset.

TLDR: a coreset is a reduced set of \(\hat{n}\) (potentially weighted) data points that, in some sense, best represent the “important” properties of a larger set of \(n > \hat{n}\) (potentially weighted) data points.

Given a dataset \(X = \{x_i\}_{i=1}^n, x \in \Omega\), where each node is paired with a non-negative (probability) weight \(w_i \in \mathbb{R} \ge 0\), there exists an implied discrete (probability) measure over \(\Omega\)

\[\eta_n = \sum_{i=1}^{n} w_i \delta_{x_i}.\]

If we then specify a set of test-functions \(\Phi = {\phi_1, \dots, \phi_M}\), where \(\phi_i \colon \Omega \to \mathbb{R}\), which somehow capture the “important” properties of the data, then there also exists an implied push-forward measure over \(\mathbb{R}^M\)

\[\mu_n = \sum_{i=1}^{n} w_i \delta_{\Phi(x_i)}.\]

A coreset is simply a reduced measure containing \(\hat{n} < n\) updated nodes \(\hat{x}_i\) and weights \(\hat{w}_i\), such that the push-forward measure of the coreset \(\nu_\hat{n}\) has (approximately for some algorithms) the same “centre-of-mass” as the push-forward measure for the original data \(\mu_n\)

\[\text{CoM}(\mu_n) = \text{CoM}(\nu_\hat{n}), \text{CoM}(\nu_\hat{n}) = \int_\Omega \Phi(\omega) d\nu_\hat{x}(\omega), \text{CoM}(\nu_\hat{n}) = \sum_{i=1}^\hat{n} \hat{w}_i \delta_{\Phi(\hat{x}_i)}.\]

Note

Depending on the algorithm, the test-functions may be explicitly specified by the user, or implicitly defined by the algorithm’s specific objectives.

Parameters:

nodes (Data) – The (weighted) coreset nodes, math:x_i in text{supp}(nu_hat{n}); once instantiated, the nodes should be accessed via Coresubset.coreset()
pre_coreset_data (Data) – The dataset \(X\) used to construct the coreset.

nodes: Data#

pre_coreset_data: Data#

property coreset: _Data#: Materialised coreset.

solve_weights(solver, **solver_kwargs)[source]#

Return a copy of ‘self’ with weights solved by ‘solver’.

Return type:: Self
Parameters:: solver (WeightsOptimiser[_Data]) –

compute_metric(metric, **metric_kwargs)[source]#

Return metric-distance between self.pre_coreset_data and self.coreset.

Return type:: Shaped[Array, '']
Parameters:: metric (Metric[_Data]) –

class coreax.coreset.Coresubset(nodes, pre_coreset_data)[source]#

Bases: Coreset[Data], Generic[_Data]

Data structure for representing a coresubset.

A coresubset is a :class`Coreset`, with the additional condition that the support of the reduced measure (the coreset), must be a subset of the support of the original measure (the original data), such that

\[\hat{x}_i = x_i, \forall i \in I, I \subset \{1, \dots, n\}, text{card}(I) = \hat{n}.\]

Thus, a coresubset, unlike a coreset, ensures that feasibility constraints on the support of the measure are maintained [litterer2012recombination]. This is vital if, for example, the test-functions are only defined on the support of the original measure/nodes, rather than all of \(\Omega\).

In coresubsets, the measure reduction can be implicit (setting weights/nodes to zero for all \(i \in I \ {1, \dots, n}\)) or explicit (removing entries from the weight/node arrays). The implicit approach is useful when input/output array shape stability is required (E.G. for some JAX transformations); the explicit approach is more similar to a standard coreset.

Parameters:

nodes (Data) – The (weighted) coresubset node indices, \(I\); the materialised coresubset nodes should be accessed via Coresubset.coreset().
pre_coreset_data (Data) – The dataset \(X\) used to construct the coreset.

property coreset: _Data#: Materialise the coresubset from the indices and original data.

property unweighted_indices: Shaped[Array, 'n']#: Unweighted Coresubset indices - attribute access helper.