Quickstart ========== Installation ------------ This is an alpha version, so one still must install from source:: git clone git@github.com:kalebruscitti/thematic-search.git pip install thematic-search Basic Usage ----------- To use Thematic Search, you need a hierarchical topic model of your dataset. The minimal ingredients are: - ``embedding_vectors``: an ``(n_docs, d)`` float array of document embeddings - ``cluster_tree``: a dictionary ``{(l, i)): [children]}`` representing your topic hierarchy, where the keys ``(l, i)`` are tuples of ints. - ``cluster_layers``: a list of ``(n_docs, n_clusters)``-shaped float arrays in ``[0,1]``, one per layer, where ``cluster_layers[l][j, i]`` is the inclusion strength of document ``j`` in the ``i``-th cluster at layer ``l`` (i.e. node ``(l,i)`` of the cluster tree). Optionally you can also provide: - ``topic_metadata``: a ``DataFrame`` with a row for each node in ``cluster_tree``, indexed by the same node labels - ``document_metadata``: a ``DataFrame`` with a row for each document - ``reduced_vectors``: an ``(n_docs, 2)`` array of low-dimensional vectors for visualisation Converting your cluster tree ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Your ``cluster_tree`` can use your own node labels. If you have a ``cluster_tree`` in the format ``{node: [children]}``, where ``node`` is any hashable label for each vertex of the tree, the ``convert_tree`` utility will it into the ``(layer, index)`` tuple format required by ``SoftClusterTree``, as well as returning a ``cluster_labels`` mapping that lets ``TopicDatabase`` automatically align your ``topic_metadata``:: from thematic_search.utilities import convert_tree # Example: a simple tree with string node labels my_tree = { "root": ["science", "sports"], "science": ["physics", "biology"], "sports": ["football", "tennis"], "physics": [], "biology": [], "football": [], "tennis": [], } cluster_tree, cluster_labels = convert_tree(my_tree) If your nodes are not naturally arranged into layers, ``convert_tree`` will assign layers automatically: leaves get layer 0, and each internal node gets one layer above its deepest child. You can override this by passing a custom ``layers`` dictionary:: cluster_tree, cluster_labels = convert_tree(my_tree, layers={"root": 3, ...}) Initializing a TopicDatabase ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pass ``cluster_labels`` to ``TopicDatabase`` alongside your metadata. If you provide a ``topic_metadata`` DataFrame indexed by your original node labels, it will be re-indexed automatically:: from thematic_search import TopicDatabase, SoftClusterTree topicdb = TopicDatabase( SoftClusterTree(cluster_layers, cluster_tree), embedding_vectors=embedding_vectors, reduced_vectors=reduced_vectors, # optional sample_df=document_metadata, # optional topic_df=topic_metadata, # indexed by your node labels cluster_labels=cluster_labels, # from convert_tree ) Querying -------- The query interface is accessed via ``topicdb.q``. Queries are chainable and follow the arrows in the database schema: you can start from a text string, a set of document indices, or a known topic, and navigate between documents and topics using the methods below. Semantic search ~~~~~~~~~~~~~~~ Find the documents nearest to a query string in embedding space:: topicdb.q.neighbours("Advancements in space technology").metadata() This requires an ``embedding_model`` property to be provided to the TopicDatabase. Thematic search ~~~~~~~~~~~~~~~ Find the most specific topic that best covers the nearest neighbours of a query string:: topicdb.q.neighbours("Advancements in space technology").theme().metadata() Find all documents inside a given topic with at least 75% inclusion strength:: topicdb.q.topic_name("science").samples(0.75).metadata() Composing queries ~~~~~~~~~~~~~~~~ Queries can be composed. For example, to find the theme of the documents inside the parent of a known topic:: topicdb.q.topic_name("physics").parents().samples().theme().metadata()