thematic_search.SoftClusterTree

class thematic_search.SoftClusterTree(cluster_matrices: list, cluster_tree: dict, sparsity_threshold: float = 0.0)

A hierarchical soft clustering structure storing inclusion strengths as uint8 sparse matrices (0-255, divide by 255 to recover floats).

The cluster hierarchy may be a DAG (directed acyclic graph), meaning a node may have multiple parents. Edges must respect the layer ordering: if (s, k) is a parent of (l, i) then s > l. The inclusion strength consistency assumption must hold for every edge: c^s_k(r) >= c^l_i(r) for all records r.

Parameters:
  • cluster_matrices (list of np.ndarray) – List of L dense float arrays, one per layer. cluster_matrices[l] has shape (n_docs, n_clusters_at_layer_l), with values in [0, 1].

  • cluster_tree (dict) – A dict mapping (layer, cluster_number) tuples to lists of children tuples, e.g. {(2, 0): [(1, 0), (1, 1)], (2, 1): [(1, 2)], …} The unique root node is the key with no parents, i.e. the node that does not appear in any value list.

  • sparsity_threshold (float, optional (default=0.0)) – Inclusion strengths below this value are set to zero before sparsification. Useful for cleaning up near-zero soft memberships.

__init__(cluster_matrices: list, cluster_tree: dict, sparsity_threshold: float = 0.0)

Methods

__init__(cluster_matrices, cluster_tree[, ...])

children(idx)

Return the child indices of a cluster.

cluster(layer, cluster_number)

Convenience method to construct a Cluster from a (layer, cluster_number) pair.

inside(expr[, threshold])

Return indices of documents satisfying the cluster expression with inclusion strength >= threshold.

join(indices)

Return the indices of the least upper bounds (LUBs) of a set of clusters, i.e. their lowest common ancestors in the DAG.

parents(idx)

Return the parent indices of a cluster, or an empty list if it is the root.

strengths(expr[, indices, as_float])

Return the inclusion strengths of a set of documents for a given cluster expression.

to_float(uint8_value)

Convert uint8 inclusion strength to float in [0, 1].

to_int(float_value)

Convert float inclusion strength in [0, 1] to uint8.

Attributes

cluster_matrices

Reconstruct the dense cluster matrices for saving purposes.

topics