Quickstart
Installation
This is an alpha version, so one still must install from source:
git clone git@github.com:kalebruscitti/thematic-search.git
pip install thematic-search
Basic Usage
To use Thematic Search, you need a hierarchical topic model of your dataset. The minimal ingredients are:
embedding_vectors: an(n_docs, d)float array of document embeddingscluster_tree: a dictionary{(l, i)): [children]}representing your topic hierarchy, where the keys(l, i)are tuples of ints.cluster_layers: a list of(n_docs, n_clusters)-shaped float arrays in[0,1], one per layer, wherecluster_layers[l][j, i]is the inclusion strength of documentjin thei-th cluster at layerl(i.e. node(l,i)of the cluster tree).
Optionally you can also provide:
topic_metadata: aDataFramewith a row for each node incluster_tree, indexed by the same node labelsdocument_metadata: aDataFramewith a row for each documentreduced_vectors: an(n_docs, 2)array of low-dimensional vectors for visualisation
Converting your cluster tree
Your cluster_tree can use your own node labels. If you have a cluster_tree
in the format {node: [children]}, where node is any hashable label for each vertex
of the tree, the convert_tree utility will it into the (layer, index) tuple format
required by SoftClusterTree, as well as returning a cluster_labels mapping that
lets TopicDatabase automatically align your topic_metadata:
from thematic_search.utilities import convert_tree
# Example: a simple tree with string node labels
my_tree = {
"root": ["science", "sports"],
"science": ["physics", "biology"],
"sports": ["football", "tennis"],
"physics": [], "biology": [], "football": [], "tennis": [],
}
cluster_tree, cluster_labels = convert_tree(my_tree)
If your nodes are not naturally arranged into layers, convert_tree will
assign layers automatically: leaves get layer 0, and each internal node gets
one layer above its deepest child. You can override this by passing a custom
layers dictionary:
cluster_tree, cluster_labels = convert_tree(my_tree, layers={"root": 3, ...})
Initializing a TopicDatabase
Pass cluster_labels to TopicDatabase alongside your metadata. If you
provide a topic_metadata DataFrame indexed by your original node labels,
it will be re-indexed automatically:
from thematic_search import TopicDatabase, SoftClusterTree
topicdb = TopicDatabase(
SoftClusterTree(cluster_layers, cluster_tree),
embedding_vectors=embedding_vectors,
reduced_vectors=reduced_vectors, # optional
sample_df=document_metadata, # optional
topic_df=topic_metadata, # indexed by your node labels
cluster_labels=cluster_labels, # from convert_tree
)
Querying
The query interface is accessed via topicdb.q. Queries are chainable and
follow the arrows in the database schema: you can start from a text string, a
set of document indices, or a known topic, and navigate between documents and
topics using the methods below.
Semantic search
Find the documents nearest to a query string in embedding space:
topicdb.q.neighbours("Advancements in space technology").metadata()
This requires an embedding_model property to be provided to the TopicDatabase.
Thematic search
Find the most specific topic that best covers the nearest neighbours of a query string:
topicdb.q.neighbours("Advancements in space technology").theme().metadata()
Find all documents inside a given topic with at least 75% inclusion strength:
topicdb.q.topic_name("science").samples(0.75).metadata()
Composing queries
Queries can be composed. For example, to find the theme of the documents inside the parent of a known topic:
topicdb.q.topic_name("physics").parents().samples().theme().metadata()