Queries
On this page, we will go over all the queries available with a TopicDatabase.
First, let’s load a TopicDatabase from a file as well as our embedding model. This database contains data from the 20 Newsgroups dataset - you can check out the 20-Newsgroups example to see how it was made.
[ ]:
import thematic_search as ts
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
topicdb = ts.TopicDatabase.from_file("20ng-topicdb.tm.zip")
topicdb.embedding_model = model
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
MPNetModel LOAD REPORT from: sentence-transformers/all-mpnet-base-v2
Key | Status | |
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED | |
Notes:
- UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
C:\Users\User\Documents\topicdb\thematic-search\thematic_search\softclustertree.py:136: UserWarning: You passed sparse matrices, SoftClusterTree is assuming they are ranged in [0,255]
warnings.warn("You passed sparse matrices," \
The Kernel crashed while executing code in the current cell or a previous cell.
Please review the code in the cell(s) to identify a possible cause of the failure.
Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info.
View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.
Query Types
Fundamentally, a TopicDatabase consists of two types of objects, topics and samples, and the hierarchical topic model that links these together. The soft cluster strengths of the topic model are stored as a matrix TopicDatabase.cluster_matrix, whose rows are indexed by samples and whose columns are indexed by vectors.
For this reason, Thematic Search has three query classes that are used to compose queries. These are SampleQuery, TopicQuery and FuzzyQuery. Each class is initialized when a query is made and stores the output of the query as an attribute. For SampleQuery and TopicQuery, the result is stored in query.indices as a numpy array of the indices of the samples/topics that satisfy the query. For FuzzyQuery, query.matrix stores the entire [0,1]-valued matrix satisfying
the query.
Query Entrypoints
To initialize a query, you can use TopicDatabase.q, and then call any of the query entrypoint functions. The diagram above this paragraph conveniently displays each query function’s name and what type of query object it returns. In more detail, the queries are:
samples_where("query_string")- retuns a FuzzyQuery filtered by indices of documents satisfying the query string (using Pandas’ .query() method).topics_where("query_string")- returns a FuzzyQuery filtered by topics whose metadata satisfies the query string (using Pandas’ query() method).samples([i1,i2,...,ik])- returns an SampleQuery with indices[i1,...,ik]topic((layer, cluster))- returns a TopicQuery with topic(layer, cluster)topic_name("my-topic-string")- returns a TopicQuery with the topic whose name is “my-topic-string”topic_idx(15)- returns a TopicQuery with the topic whose index intopic_dfis 15neighbours("my search string")- embeds the string “my search string” usingTopicDatabase.embedding_modeland then returns a SampleQuery with its nearest neighbours.
Let’s see these all in code:
[ ]:
print(
"samples_where:", topicdb.q.samples_where("newsgroup == 'rec.sport.hockey'")
)
print(
"topics_where:", topicdb.q.topics_where("layer==0 & cluster<=2")
)
print(
"samples:", topicdb.q.samples([10,20,30,40])
)
print(
"topic:", topicdb.q.topic(1,0)
)
print(
"topic_name:", topicdb.q.topic_name("alt.atheism")
)
print(
"topic_idx:", topicdb.q.topic_idx(15)
)
print(
"neighbours:", topicdb.q.neighbours("comparison of Mac and PC computers")
)
Composing Queries
Each Query class also has methods for converting data between the different types of query data. Once again, the function names and their return types are shown in the diagram above this paragraph. There are also SampleQuery.topics and TopicQuery.samples which are equal to SampleQuery.to_fuzzy().topics and TopicQuery.to_fuzzy().samples respectively.
The samples and theme queries involve thresholding the cluster strength matrix, and thus take a float parameter threshold in [0,1] (By default 1.0). For any value >0 these are destructive operations.
Here is an example of initializing a query and then composing:
[ ]:
topicdb.q.neighbours("advancements in computer graphics").theme().to_fuzzy()
FuzzyQuery(944 samples, 1 topics)
Query Endomorphisms and Endpoints
Each query type also has endomorphisms, which is a fancy way of saying methods that return the same type of query. For example, SampleQuery has a neighbours method that computes the k nearest neighbours of each sample in SampleQuery.indices and then returns a new SampleQuery containing all those documents.
In addition to endomorphisms, each query type has endpoints, which are methods that return some non-query object you may be interested in. This includes pandas dataframes containing metadata or numpy arrays of embedding/reduced vectors.
SampleQuery Methods
SampleQuery’s endomorphisms are:
SampleQuery.neighbours(k)returns a SampleQuery containing the k nearest neighbours of the input samples.SampleQuery.where(query_string)filters the input samples using the condition inquery_string(Pandas DataFrame.query format)
SampleQuery’s endpoints are:
SampleQuery.metadata()returns a Pandas DataFrame of metadata for the input samples.SampleQuery.embeddings()returns the embedding vectors for the input samples.SampleQuery.strengths()returns the cluster inclusion strength matrix for the input samples.SampleQuery.indices(an attribute not a method) the raw set of indices for the samples.
TopicQuery Methods
TopicQuery’s endomorphisms are:
TopicQuery.parents()returns the parents of the input topicsTopicQuery.children()returns the children of the input topicsTopicQuery.least_upper_bound()returns the least upper bound in the topic tree of the input topicsTopicQuery.where(query_string)filters the input topics based on the query string (Pandas DataFrame.query format)
TopicQuery’s endpoints are:
TopicQuery.metadata()returns a Pandas DataFrame of metadata for the input topics
FuzzyQuery Methods
FuzzyQuery’s endomorphisms are:
FuzzyQuery.topics_where(query_string)filters the columns of the cluster_matrix by topic metadata (Pandas DataFrame.query format)FuzzyQuery.samples_where(query_string)filters the rows of the cluster_matrix by sample metadata (Pandas DataFrame.query format)
At the time of writing FuzzyQuery has no endpoints.
Endpoint Examples
[ ]:
topicdb.q.neighbours("nasa and space exploration").theme().parents().metadata()
| name | layer | cluster | |
|---|---|---|---|
| index | |||
| 27 | sci | 1 | 7 |