thematic_search.TopicDatabase

class thematic_search.TopicDatabase(soft_cluster_tree: SoftClusterTree, embedding_vectors: ndarray, reduced_vectors: ndarray = None, sample_df: DataFrame = None, topic_df: DataFrame = None, cluster_labels: dict = None, embedding_model=None, default_k: int = 15)

A hierarchical soft-clustering database for thematic search.

Packages together: - A SoftClusterTree for the hierarchical intuitionistic database - A pynndescent NNDescent index for the vector database - A document metadata DataFrame - A topic metadata DataFrame

Parameters:
  • soft_cluster_tree (SoftClusterTree) – The hierarchical soft clustering structure.

  • embedding_vectors (np.ndarray) – The embedding vectors of the documents, shape (n_docs, n_features).

  • reduced_vectors (np.ndarray) – The reduced vectors of the documents, shape (n_docs, n_reduced_features).

  • sample_df (pd.DataFrame, optional) – Document metadata. If None, a minimal DataFrame with just indices is created.

  • topic_df (pd.DataFrame, optional) – Topic metadata. Must have an ‘index’ column as primary key if provided. If None, a minimal DataFrame with idx, layer and cluster_number is created.

  • cluster_labels (dict, optional) – A dict mapping (layer, cluster_number) tuples to original node labels, as returned by convert_tree(). If provided, topic_df.index is expected to use those original labels and will be re-indexed to numeric indices internally. The mapping is stored as self.cluster_labels for display purposes.

  • embedding_model (optional) – A SentenceTransformer model for use with topicdb.q.search(). If None, search() will raise a helpful error.

  • default_k (int, optional (default=15)) – Default number of nearest neighbours for nearby() queries.

__init__(soft_cluster_tree: SoftClusterTree, embedding_vectors: ndarray, reduced_vectors: ndarray = None, sample_df: DataFrame = None, topic_df: DataFrame = None, cluster_labels: dict = None, embedding_model=None, default_k: int = 15)

Methods

__init__(soft_cluster_tree, embedding_vectors)

from_file(path)

Load a TopicDatabase from a tm.zip file.

from_lance(path)

Load a TopicDatabase from a LanceDB folder.

from_topic_model(topic_model)

Integration with Toponymy's TopicModel class.

to_file(path)

Save a TopicDatbase to a tm.zip file.

to_lance(path)

Save a TopicDatabase to a LanceDB folder.

Attributes

cluster_matrix

Return the Fuzzy inclusion cluster matrix.

q

Entry point for all queries.

topics

tree

Return the database tree as a dictionary.