{ "cells": [ { "cell_type": "markdown", "id": "a433ee4d", "metadata": {}, "source": [ "# Queries\n", "\n", "On this page, we will go over all the queries available with a TopicDatabase.\n", "\n", "First, let's load a TopicDatabase from a file as well as our embedding model. This database contains data from the 20 Newsgroups dataset - you can check out the [20-Newsgroups example](/newsgroups.html) to see how it was made." ] }, { "cell_type": "code", "execution_count": null, "id": "db0bf3c2", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "4f29fbfc0c01453f9771ece50fa4d43f", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Loading weights: 0%| | 0/199 [00:00here for more info. \n", "\u001b[1;31mView Jupyter log for further details." ] } ], "source": [ "import thematic_search as ts\n", "from sentence_transformers import SentenceTransformer\n", "\n", "model = SentenceTransformer(\"sentence-transformers/all-mpnet-base-v2\")\n", "\n", "topicdb = ts.TopicDatabase.from_file(\"20ng-topicdb.tm.zip\")\n", "topicdb.embedding_model = model" ] }, { "cell_type": "markdown", "id": "779f28b1", "metadata": {}, "source": [ "## Query Types\n", "\n", "Fundamentally, a TopicDatabase consists of two types of objects, *topics* and *samples*, and the hierarchical topic model that links these together. The soft cluster strengths of the topic model are stored as a matrix `TopicDatabase.cluster_matrix`, whose rows are indexed by samples and whose columns are indexed by vectors.\n", "\n", "For this reason, Thematic Search has three query classes that are used to compose queries. These are ``SampleQuery``, ``TopicQuery`` and ``FuzzyQuery``. Each class is initialized when a query is made and stores the output of the query as an attribute. For ``SampleQuery`` and ``TopicQuery``, the result is stored in `query.indices` as a numpy array of the indices of the samples/topics that satisfy the query. For ``FuzzyQuery``, `query.matrix` stores the entire [0,1]-valued matrix satisfying the query.\n", "\n", "## Query Entrypoints\n", "\n", "![A diagram of all query entrypoints](query_entries.PNG)\n", "\n", "\n", "To initialize a query, you can use `TopicDatabase.q`, and then call any of the query entrypoint functions. The diagram above this paragraph conveniently displays each query function's name and what type of query object it returns. In more detail, the queries are:\n", "\n", "- `samples_where(\"query_string\")` - retuns a FuzzyQuery filtered by indices of documents satisfying the query string (using Pandas' .query() method).\n", "- `topics_where(\"query_string\")` - returns a FuzzyQuery filtered by topics whose metadata satisfies the query string (using Pandas' query() method).\n", "- `samples([i1,i2,...,ik])` - returns an SampleQuery with indices `[i1,...,ik]`\n", "- `topic((layer, cluster))` - returns a TopicQuery with topic `(layer, cluster)`\n", "- `topic_name(\"my-topic-string\")` - returns a TopicQuery with the topic whose name is \"my-topic-string\"\n", "- `topic_idx(15)` - returns a TopicQuery with the topic whose index in `topic_df` is 15\n", "- `neighbours(\"my search string\")` - embeds the string \"my search string\" using `TopicDatabase.embedding_model` and then returns a SampleQuery with its nearest neighbours.\n", "\n", "Let's see these all in code:\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "41e5c4ac", "metadata": {}, "outputs": [], "source": [ "print(\n", " \"samples_where:\", topicdb.q.samples_where(\"newsgroup == 'rec.sport.hockey'\")\n", ")\n", "print(\n", " \"topics_where:\", topicdb.q.topics_where(\"layer==0 & cluster<=2\")\n", ")\n", "print(\n", " \"samples:\", topicdb.q.samples([10,20,30,40])\n", ")\n", "print(\n", " \"topic:\", topicdb.q.topic(1,0)\n", ")\n", "print(\n", " \"topic_name:\", topicdb.q.topic_name(\"alt.atheism\")\n", ")\n", "print(\n", " \"topic_idx:\", topicdb.q.topic_idx(15)\n", ")\n", "print(\n", " \"neighbours:\", topicdb.q.neighbours(\"comparison of Mac and PC computers\")\n", ")" ] }, { "cell_type": "markdown", "id": "9299aee0", "metadata": {}, "source": [ "## Composing Queries\n", "\n", "![A diagram of query compositions](data_migrations.PNG)\n", "\n", "\n", "Each Query class also has methods for converting data between the different types of query data. Once again, the function names and their return types are shown in the diagram above this paragraph. There are also `SampleQuery.topics` and `TopicQuery.samples` which are equal to `SampleQuery.to_fuzzy().topics` and `TopicQuery.to_fuzzy().samples` respectively.\n", "\n", "The `samples` and `theme` queries involve thresholding the cluster strength matrix, and thus take a float parameter `threshold` in [0,1] (By default 1.0). For any value >0 these are destructive operations.\n", "\n", "Here is an example of initializing a query and then composing:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "b0570815", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "FuzzyQuery(944 samples, 1 topics)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "topicdb.q.neighbours(\"advancements in computer graphics\").theme().to_fuzzy()" ] }, { "cell_type": "markdown", "id": "b5cb837a", "metadata": {}, "source": [ "## Query Endomorphisms and Endpoints\n", "\n", "Each query type also has endomorphisms, which is a fancy way of saying methods that return the same type of query. For example, `SampleQuery` has a `neighbours` method that computes the k nearest neighbours of each sample in `SampleQuery.indices` and then returns a new SampleQuery containing all those documents.\n", "\n", "In addition to endomorphisms, each query type has *endpoints*, which are methods that return some non-query object you may be interested in. This includes pandas dataframes containing metadata or numpy arrays of embedding/reduced vectors.\n", "\n", "### SampleQuery Methods\n", "\n", "SampleQuery's endomorphisms are:\n", "\n", "- `SampleQuery.neighbours(k)` returns a SampleQuery containing the k nearest neighbours of the input samples.\n", "- `SampleQuery.where(query_string)` filters the input samples using the condition in `query_string` (Pandas DataFrame.query format)\n", "\n", "SampleQuery's endpoints are:\n", "\n", "- `SampleQuery.metadata()` returns a Pandas DataFrame of metadata for the input samples.\n", "- `SampleQuery.embeddings()` returns the embedding vectors for the input samples.\n", "- `SampleQuery.strengths()` returns the cluster inclusion strength matrix for the input samples.\n", "- `SampleQuery.indices` (an attribute not a method) the raw set of indices for the samples.\n", "\n", "### TopicQuery Methods\n", "\n", "TopicQuery's endomorphisms are:\n", "\n", "- `TopicQuery.parents()` returns the parents of the input topics\n", "- `TopicQuery.children()` returns the children of the input topics\n", "- `TopicQuery.least_upper_bound()` returns the least upper bound in the topic tree of the input topics\n", "- `TopicQuery.where(query_string)` filters the input topics based on the query string (Pandas DataFrame.query format)\n", "\n", "TopicQuery's endpoints are:\n", "\n", "- `TopicQuery.metadata()` returns a Pandas DataFrame of metadata for the input topics\n", "\n", "### FuzzyQuery Methods\n", "\n", "FuzzyQuery's endomorphisms are:\n", "\n", "- `FuzzyQuery.topics_where(query_string)` filters the columns of the cluster_matrix by topic metadata (Pandas DataFrame.query format)\n", "- `FuzzyQuery.samples_where(query_string)` filters the rows of the cluster_matrix by sample metadata (Pandas DataFrame.query format)\n", "\n", "At the time of writing FuzzyQuery has no endpoints.\n", "\n", "### Endpoint Examples\n" ] }, { "cell_type": "code", "execution_count": null, "id": "e5a7c5e3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namelayercluster
index
27sci17
\n", "
" ], "text/plain": [ " name layer cluster\n", "index \n", "27 sci 1 7" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "topicdb.q.neighbours(\"nasa and space exploration\").theme().parents().metadata()" ] } ], "metadata": { "kernelspec": { "display_name": "venv-tdb", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.7" } }, "nbformat": 4, "nbformat_minor": 5 }