thematic_search.TopicDatabase
- class thematic_search.TopicDatabase(soft_cluster_tree: SoftClusterTree, embedding_vectors: ndarray, reduced_vectors: ndarray = None, sample_df: DataFrame = None, topic_df: DataFrame = None, cluster_labels: dict = None, embedding_model=None, default_k: int = 15)
A hierarchical soft-clustering database for thematic search.
Packages together: - A SoftClusterTree for the hierarchical intuitionistic database - A pynndescent NNDescent index for the vector database - A document metadata DataFrame - A topic metadata DataFrame
- Parameters:
soft_cluster_tree (SoftClusterTree) – The hierarchical soft clustering structure.
embedding_vectors (np.ndarray) – The embedding vectors of the documents, shape (n_docs, n_features).
reduced_vectors (np.ndarray) – The reduced vectors of the documents, shape (n_docs, n_reduced_features).
sample_df (pd.DataFrame, optional) – Document metadata. If None, a minimal DataFrame with just indices is created.
topic_df (pd.DataFrame, optional) – Topic metadata. Must have an ‘index’ column as primary key if provided. If None, a minimal DataFrame with idx, layer and cluster_number is created.
cluster_labels (dict, optional) – A dict mapping (layer, cluster_number) tuples to original node labels, as returned by convert_tree(). If provided, topic_df.index is expected to use those original labels and will be re-indexed to numeric indices internally. The mapping is stored as self.cluster_labels for display purposes.
embedding_model (optional) – A SentenceTransformer model for use with topicdb.q.search(). If None, search() will raise a helpful error.
default_k (int, optional (default=15)) – Default number of nearest neighbours for nearby() queries.
- __init__(soft_cluster_tree: SoftClusterTree, embedding_vectors: ndarray, reduced_vectors: ndarray = None, sample_df: DataFrame = None, topic_df: DataFrame = None, cluster_labels: dict = None, embedding_model=None, default_k: int = 15)
Methods
__init__(soft_cluster_tree, embedding_vectors)from_file(path)Load a TopicDatabase from a tm.zip file.
from_lance(path)Load a TopicDatabase from a LanceDB folder.
from_topic_model(topic_model)Integration with Toponymy's TopicModel class.
to_file(path)Save a TopicDatbase to a tm.zip file.
to_lance(path)Save a TopicDatabase to a LanceDB folder.
Attributes
cluster_matrixReturn the Fuzzy inclusion cluster matrix.
qEntry point for all queries.
topicstreeReturn the database tree as a dictionary.