{
"cells": [
{
"cell_type": "markdown",
"id": "314c2924",
"metadata": {},
"source": [
"# What is 'Thematic Search'?\n",
"\n",
"What do we even mean by *thematic search?* How does it differ from semantic search? How does it work mathematically? \n",
"\n",
"On this page, we will answer these questions through a mix of theory and example. First, our starting assumptions. Our baseline assumption is that we have a corpus of documents, and a hierarchical topic model for this corpus. In particular, a \"topic\" here means a set (or fuzzy set) of documents, and \"hierarchical\" means that topics come with some measure of coarseness or specificity, such as depth in a topic tree.\n",
"\n",
"Conceptually, *thematic search* refers to two dual search operations:\n",
"1. Given a topic, which documents belong to that topic? \n",
"2. Given a set of documents, what is their *theme*?\n",
"\n",
"By \"theme\" we mean the *most specific topic containing the documents*. The first operation searchs the set of documents and conversely the second operation searchs the set of topics.\n",
"\n",
"As a running example, we will use the 20 Newsgroups dataset. Samples in this dataset are Usenet forum posts, and they are labeled with a 'newsgroup' attribute that says which topic they were posted to. Let's load up a TopicDatabase of 20 Newsgroups. The page [Searching 20-Newsgroups ](/newsgroups.html) shows how we prepared the data for thematic search."
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "ec29a6e3",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "9e06c61b59bd4c81a3f2748bcd3600e9",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Loading weights: 0%| | 0/199 [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\u001b[1mMPNetModel LOAD REPORT\u001b[0m from: sentence-transformers/all-mpnet-base-v2\n",
"Key | Status | | \n",
"------------------------+------------+--+-\n",
"embeddings.position_ids | UNEXPECTED | | \n",
"\n",
"\u001b[3mNotes:\n",
"- UNEXPECTED\u001b[3m\t:can be ignored when loading from different task/architecture; not ok if you expect identical arch.\u001b[0m\n"
]
}
],
"source": [
"import thematic_search as ts\n",
"from sentence_transformers import SentenceTransformer\n",
"\n",
"model = SentenceTransformer(\"sentence-transformers/all-mpnet-base-v2\")\n",
"\n",
"topicdb = ts.TopicDatabase.from_file('docs/source/20ng-topicdb.tm.zip')\n",
"topicdb.embedding_model = model"
]
},
{
"cell_type": "markdown",
"id": "8f6de401",
"metadata": {},
"source": [
"\n",
"Here is a sub-tree of the topic tree of the dataset:\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"id": "3a56207e",
"metadata": {},
"source": [
"## Searching documents by topic\n",
"\n",
"The more straight-forward operation is searching the set of documents by topic. For example, searching for documents about science should return us documents from the four sub-newsgroups `sci.crypt, sci.electronics, sci.med` and `sci.space`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a4f6dc09",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" post | \n",
" newsgroup | \n",
"
\n",
" \n",
" \n",
" \n",
" | 5 | \n",
" \\n\\nBack in high school I worked as a lab assi... | \n",
" sci.electronics | \n",
"
\n",
" \n",
" | 11 | \n",
" >say they have a \"history of untrustworthy be... | \n",
" sci.crypt | \n",
"
\n",
" \n",
" | 13 | \n",
" How about Kirlian imaging ? I believe the FAQ... | \n",
" sci.med | \n",
"
\n",
" \n",
" | 16 | \n",
" Many thanks to those who replied to my appeal ... | \n",
" sci.electronics | \n",
"
\n",
" \n",
" | 17 | \n",
" .........\\nI, some years ago, almost became a ... | \n",
" sci.electronics | \n",
"
\n",
" \n",
" | ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" | 18143 | \n",
" \\nYou'd have to purify the river water first. ... | \n",
" sci.med | \n",
"
\n",
" \n",
" | 18158 | \n",
" \\nProbably keep quiet and take it, lest they g... | \n",
" sci.crypt | \n",
"
\n",
" \n",
" | 18159 | \n",
" \\nThey've been out of busines for years.\\n\\n\\n... | \n",
" sci.electronics | \n",
"
\n",
" \n",
" | 18165 | \n",
" DN> From: nyeda@cnsvax.uwec.edu (David Nye)\\nD... | \n",
" sci.med | \n",
"
\n",
" \n",
" | 18166 | \n",
" \\nNot in isolated ground recepticles (usually ... | \n",
" sci.electronics | \n",
"
\n",
" \n",
"
\n",
"
3804 rows × 2 columns
\n",
"
"
],
"text/plain": [
" post newsgroup\n",
"5 \\n\\nBack in high school I worked as a lab assi... sci.electronics\n",
"11 >say they have a \"history of untrustworthy be... sci.crypt\n",
"13 How about Kirlian imaging ? I believe the FAQ... sci.med\n",
"16 Many thanks to those who replied to my appeal ... sci.electronics\n",
"17 .........\\nI, some years ago, almost became a ... sci.electronics\n",
"... ... ...\n",
"18143 \\nYou'd have to purify the river water first. ... sci.med\n",
"18158 \\nProbably keep quiet and take it, lest they g... sci.crypt\n",
"18159 \\nThey've been out of busines for years.\\n\\n\\n... sci.electronics\n",
"18165 DN> From: nyeda@cnsvax.uwec.edu (David Nye)\\nD... sci.med\n",
"18166 \\nNot in isolated ground recepticles (usually ... sci.electronics\n",
"\n",
"[3804 rows x 2 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"topicdb.q.topic_name(\"sci\").samples().metadata()"
]
},
{
"cell_type": "markdown",
"id": "54594ec0",
"metadata": {},
"source": [
"Although it is straight-forward, it is surprisingly hard to replicate with keyword or semantic search. For example, searching by the keyword \"science\" gives:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0f979ae7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" post | \n",
" newsgroup | \n",
"
\n",
" \n",
" \n",
" \n",
" | 275 | \n",
" \\nReading this definition, I wonder: when shou... | \n",
" sci.med | \n",
"
\n",
" \n",
" | 388 | \n",
" #In article <1r3tqo$ook@horus.ap.mchp.sni.de>\\... | \n",
" alt.atheism | \n",
"
\n",
" \n",
" | 389 | \n",
" -*-----\\n\\nI think the question is: What is ex... | \n",
" sci.med | \n",
"
\n",
" \n",
" | 435 | \n",
" \\n\\nPoint 1:\\n\\nI'm beginning to see that *par... | \n",
" sci.med | \n",
"
\n",
" \n",
" | 479 | \n",
" I am posting this for a friend without interne... | \n",
" sci.space | \n",
"
\n",
" \n",
" | ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" | 17671 | \n",
" \\n\\tFirst of all, I resent your assumption tha... | \n",
" soc.religion.christian | \n",
"
\n",
" \n",
" | 17811 | \n",
" Brian Yamauchi asks: [Regarding orbital billbo... | \n",
" sci.space | \n",
"
\n",
" \n",
" | 17815 | \n",
" \\n\\n\\nSorry, but yes he does, by your own desc... | \n",
" talk.religion.misc | \n",
"
\n",
" \n",
" | 17823 | \n",
" Reposted by request ... these images are great... | \n",
" comp.graphics | \n",
"
\n",
" \n",
" | 18013 | \n",
" \\n\\nOops, sorry, my words, not the words of th... | \n",
" alt.atheism | \n",
"
\n",
" \n",
"
\n",
"
251 rows × 2 columns
\n",
"
"
],
"text/plain": [
" post \\\n",
"275 \\nReading this definition, I wonder: when shou... \n",
"388 #In article <1r3tqo$ook@horus.ap.mchp.sni.de>\\... \n",
"389 -*-----\\n\\nI think the question is: What is ex... \n",
"435 \\n\\nPoint 1:\\n\\nI'm beginning to see that *par... \n",
"479 I am posting this for a friend without interne... \n",
"... ... \n",
"17671 \\n\\tFirst of all, I resent your assumption tha... \n",
"17811 Brian Yamauchi asks: [Regarding orbital billbo... \n",
"17815 \\n\\n\\nSorry, but yes he does, by your own desc... \n",
"17823 Reposted by request ... these images are great... \n",
"18013 \\n\\nOops, sorry, my words, not the words of th... \n",
"\n",
" newsgroup \n",
"275 sci.med \n",
"388 alt.atheism \n",
"389 sci.med \n",
"435 sci.med \n",
"479 sci.space \n",
"... ... \n",
"17671 soc.religion.christian \n",
"17811 sci.space \n",
"17815 talk.religion.misc \n",
"17823 comp.graphics \n",
"18013 alt.atheism \n",
"\n",
"[251 rows x 2 columns]"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"topicdb.q.samples_where(\"post.str.contains('science')\").metadata()"
]
},
{
"cell_type": "markdown",
"id": "8dcdefaf",
"metadata": {},
"source": [
"We see that we've only found 251 posts, compared to 3804 from the thematic search, and the topics of these posts are much more spread out. \n",
"\n",
"On the other hand, maybe we can do better with a semantic search? Well, semantic search immediately raises the issue that we must choose the number of nearest-neighbours to search for. We know from our thematic search that there are 3804 documents in the science topic, so what if we set $k=3804$?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dc0bd84e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"unique newsgroups: {'talk.politics.mideast', 'soc.religion.christian', 'sci.crypt', 'rec.motorcycles', 'rec.autos', 'talk.politics.misc', 'alt.atheism', 'misc.forsale', 'rec.sport.hockey', 'sci.med', 'sci.space', 'comp.graphics', 'rec.sport.baseball', 'talk.politics.guns', 'comp.sys.mac.hardware', 'sci.electronics', 'talk.religion.misc', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.windows.x'}\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" post | \n",
" newsgroup | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1835 | \n",
" \\nScience is the process of modeling the real ... | \n",
" alt.atheism | \n",
"
\n",
" \n",
" | 3283 | \n",
" ... | \n",
" rec.sport.baseball | \n",
"
\n",
" \n",
" | 13984 | \n",
" \\n\\t\\t\\t ^^^^\\n\\nJust what are these \"scient... | \n",
" soc.religion.christian | \n",
"
\n",
" \n",
" | 10076 | \n",
" -- \\njamiller@kuhub.cc.ukans.edu\\nJames Miller | \n",
" misc.forsale | \n",
"
\n",
" \n",
" | 15310 | \n",
" \\n\\t\\t\\t\\t\\tart\\n | \n",
" comp.sys.mac.hardware | \n",
"
\n",
" \n",
" | ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" | 15229 | \n",
" I believe that the large number of digits... | \n",
" rec.sport.hockey | \n",
"
\n",
" \n",
" | 16154 | \n",
" It is NOT a homeopathic remedy. Improvement be... | \n",
" sci.med | \n",
"
\n",
" \n",
" | 14837 | \n",
" \\nWhere did you hear this? If it is printed i... | \n",
" comp.windows.x | \n",
"
\n",
" \n",
" | 14 | \n",
" \\n\\n\\tThere is no notion of heliocentric, or e... | \n",
" alt.atheism | \n",
"
\n",
" \n",
" | 4313 | \n",
" For the second straight game, California score... | \n",
" rec.sport.baseball | \n",
"
\n",
" \n",
"
\n",
"
3804 rows × 2 columns
\n",
"
"
],
"text/plain": [
" post \\\n",
"1835 \\nScience is the process of modeling the real ... \n",
"3283 ... \n",
"13984 \\n\\t\\t\\t ^^^^\\n\\nJust what are these \"scient... \n",
"10076 -- \\njamiller@kuhub.cc.ukans.edu\\nJames Miller \n",
"15310 \\n\\t\\t\\t\\t\\tart\\n \n",
"... ... \n",
"15229 I believe that the large number of digits... \n",
"16154 It is NOT a homeopathic remedy. Improvement be... \n",
"14837 \\nWhere did you hear this? If it is printed i... \n",
"14 \\n\\n\\tThere is no notion of heliocentric, or e... \n",
"4313 For the second straight game, California score... \n",
"\n",
" newsgroup \n",
"1835 alt.atheism \n",
"3283 rec.sport.baseball \n",
"13984 soc.religion.christian \n",
"10076 misc.forsale \n",
"15310 comp.sys.mac.hardware \n",
"... ... \n",
"15229 rec.sport.hockey \n",
"16154 sci.med \n",
"14837 comp.windows.x \n",
"14 alt.atheism \n",
"4313 rec.sport.baseball \n",
"\n",
"[3804 rows x 2 columns]"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = topicdb.q.neighbours(\"science\", k=3804).metadata()\n",
"print('unique newsgroups:', set(results.newsgroup.values))\n",
"results"
]
},
{
"cell_type": "markdown",
"id": "671ca6c8",
"metadata": {},
"source": [
"This results in posts from literally *every* topic in the dataset. It could be however that in fact these posts *are* about science- for example that first post (index 1835) looks promising, let's look at the full text:"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "558b1da0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Science is the process of modeling the real world based on commonly agreed\n",
"interpretations of our observations (perceptions).\n",
"\n",
"\n",
"Values can also refer to meaning. For example in computer science the\n",
"value of 1 is TRUE, and 0 is FALSE. Science is based on commonly agreed\n",
"values (interpretation of observations), although science can result in a\n",
"reinterpretation of these values.\n",
"\n",
"\n",
"The values underlaying science are not objective since they have never been\n",
"fully agreed, and the change with time. The values of Newtonian physic are\n",
"certainly different to those of Quantum Mechanics.\n"
]
}
],
"source": [
"print(results.post.values[0])"
]
},
{
"cell_type": "markdown",
"id": "0b75ae39",
"metadata": {},
"source": [
"Sure, that seems like it is \"about science\". How about that last post (index 4313)?"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "63dc70be",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"For the second straight game, California scored a ton of late runs to crush\n",
"the Brewhas. It was six runs in the 8th for a 12-5 win Monday and five in\n",
"the 8th and six in the 9th for a 12-2 win yesterday. Jamie Navarro pitched\n",
"seven strong innings, but Orosco, Austin, Manzanillo an ...\n"
]
}
],
"source": [
"print(results.post.values[-1][0:280], \"...\")"
]
},
{
"cell_type": "markdown",
"id": "fb373dbd",
"metadata": {},
"source": [
"That's not about science! Of course, in practice we'd probably take a much smaller number of nearest-neighbours, and doing that gives pretty good results:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b673d3c7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" post | \n",
" newsgroup | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1835 | \n",
" \\nScience is the process of modeling the real ... | \n",
" alt.atheism | \n",
"
\n",
" \n",
" | 3283 | \n",
" ... | \n",
" rec.sport.baseball | \n",
"
\n",
" \n",
" | 13984 | \n",
" \\n\\t\\t\\t ^^^^\\n\\nJust what are these \"scient... | \n",
" soc.religion.christian | \n",
"
\n",
" \n",
" | 10076 | \n",
" -- \\njamiller@kuhub.cc.ukans.edu\\nJames Miller | \n",
" misc.forsale | \n",
"
\n",
" \n",
" | 15310 | \n",
" \\n\\t\\t\\t\\t\\tart\\n | \n",
" comp.sys.mac.hardware | \n",
"
\n",
" \n",
" | 1230 | \n",
" -*----\\nI think that part of the problem is th... | \n",
" sci.med | \n",
"
\n",
" \n",
" | 14489 | \n",
" Posted by Cathy Smith for L. Neil Smith\\n\\n ... | \n",
" talk.politics.guns | \n",
"
\n",
" \n",
" | 9688 | \n",
" Archive-name: space/mnemonics\\nLast-modified: ... | \n",
" sci.space | \n",
"
\n",
" \n",
" | 2290 | \n",
" \\nCrullerian.\\n\\n\\nCrullerian photography isn'... | \n",
" sci.med | \n",
"
\n",
" \n",
" | 3449 | \n",
" \\nSays who? Other than a hear-say god.\\n\\n\\nYo... | \n",
" alt.atheism | \n",
"
\n",
" \n",
" | 8740 | \n",
" \\nNowadays, usually with a computer. No theory... | \n",
" sci.space | \n",
"
\n",
" \n",
" | 5387 | \n",
" -- \\n ____\\n Y_,_|[]| Ernest Stalnaker... | \n",
" comp.sys.mac.hardware | \n",
"
\n",
" \n",
" | 4625 | \n",
" [...stuff deleted...]\\n\\nThank you. I thought... | \n",
" alt.atheism | \n",
"
\n",
" \n",
" | 389 | \n",
" -*-----\\n\\nI think the question is: What is ex... | \n",
" sci.med | \n",
"
\n",
" \n",
" | 11666 | \n",
" \\nWhether a scientific idea comes while one is... | \n",
" sci.med | \n",
"
\n",
" \n",
" | 13976 | \n",
" Brandon Wise\\nbwise@nyx.cs.du.edu\\n\\n\\n | \n",
" comp.os.ms-windows.misc | \n",
"
\n",
" \n",
" | 4637 | \n",
" \\ntry sci.energy | \n",
" sci.electronics | \n",
"
\n",
" \n",
" | 1136 | \n",
" In-Reply-To: <20APR199312262902@rigel.tamu.edu... | \n",
" comp.graphics | \n",
"
\n",
" \n",
" | 4047 | \n",
" \\nRobert McElwaine is the authoritative source... | \n",
" sci.space | \n",
"
\n",
" \n",
" | 15026 | \n",
" \\n\\nFor a brief, but pretty detailed account, ... | \n",
" sci.med | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" post \\\n",
"1835 \\nScience is the process of modeling the real ... \n",
"3283 ... \n",
"13984 \\n\\t\\t\\t ^^^^\\n\\nJust what are these \"scient... \n",
"10076 -- \\njamiller@kuhub.cc.ukans.edu\\nJames Miller \n",
"15310 \\n\\t\\t\\t\\t\\tart\\n \n",
"1230 -*----\\nI think that part of the problem is th... \n",
"14489 Posted by Cathy Smith for L. Neil Smith\\n\\n ... \n",
"9688 Archive-name: space/mnemonics\\nLast-modified: ... \n",
"2290 \\nCrullerian.\\n\\n\\nCrullerian photography isn'... \n",
"3449 \\nSays who? Other than a hear-say god.\\n\\n\\nYo... \n",
"8740 \\nNowadays, usually with a computer. No theory... \n",
"5387 -- \\n ____\\n Y_,_|[]| Ernest Stalnaker... \n",
"4625 [...stuff deleted...]\\n\\nThank you. I thought... \n",
"389 -*-----\\n\\nI think the question is: What is ex... \n",
"11666 \\nWhether a scientific idea comes while one is... \n",
"13976 Brandon Wise\\nbwise@nyx.cs.du.edu\\n\\n\\n \n",
"4637 \\ntry sci.energy \n",
"1136 In-Reply-To: <20APR199312262902@rigel.tamu.edu... \n",
"4047 \\nRobert McElwaine is the authoritative source... \n",
"15026 \\n\\nFor a brief, but pretty detailed account, ... \n",
"\n",
" newsgroup \n",
"1835 alt.atheism \n",
"3283 rec.sport.baseball \n",
"13984 soc.religion.christian \n",
"10076 misc.forsale \n",
"15310 comp.sys.mac.hardware \n",
"1230 sci.med \n",
"14489 talk.politics.guns \n",
"9688 sci.space \n",
"2290 sci.med \n",
"3449 alt.atheism \n",
"8740 sci.space \n",
"5387 comp.sys.mac.hardware \n",
"4625 alt.atheism \n",
"389 sci.med \n",
"11666 sci.med \n",
"13976 comp.os.ms-windows.misc \n",
"4637 sci.electronics \n",
"1136 comp.graphics \n",
"4047 sci.space \n",
"15026 sci.med "
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"topicdb.q.neighbours(\"science\", k=20).metadata()"
]
},
{
"cell_type": "markdown",
"id": "98ca9c48",
"metadata": {},
"source": [
"However it shows that searching by topic does something new and conceptually different than other search techniques. "
]
},
{
"cell_type": "markdown",
"id": "a90d37e1",
"metadata": {},
"source": [
"## Finding the theme of a set of documents\n",
"\n",
"Going in the other direction, we can take a set of documents, and ask for their theme. This is a useful thing to ask in exploratory analysis of data. We want to understand what these documents *have in common*, and as specifically as possible.\n",
"\n",
"For example, lets take a handful of documents about baseball and hockey and ask for their theme. Refering to the 20 Newsgroups topic-tree at the top of the page, we can see that the most-specific topic that contains `rec.sport.baseball` and `rec.sport.hockey` is `rec.sport`. That is, the *theme* is *sports*.\n",
"\n",
"In code:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "76ea93a1",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" post | \n",
" newsgroup | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" \\n\\nI am sure some bashers of Pens fans are pr... | \n",
" rec.sport.hockey | \n",
"
\n",
" \n",
" | 7 | \n",
" \\n[stuff deleted]\\n\\nOk, here's the solution t... | \n",
" rec.sport.hockey | \n",
"
\n",
" \n",
" | 8 | \n",
" \\n\\n\\nYeah, it's the second one. And I believ... | \n",
" rec.sport.hockey | \n",
"
\n",
" \n",
" | 33 | \n",
" \\nBe patient. He has a sore shoulder from cras... | \n",
" rec.sport.baseball | \n",
"
\n",
" \n",
" | 18132 | \n",
" Can someone send me ticket ordering informatio... | \n",
" rec.sport.baseball | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" post newsgroup\n",
"0 \\n\\nI am sure some bashers of Pens fans are pr... rec.sport.hockey\n",
"7 \\n[stuff deleted]\\n\\nOk, here's the solution t... rec.sport.hockey\n",
"8 \\n\\n\\nYeah, it's the second one. And I believ... rec.sport.hockey\n",
"33 \\nBe patient. He has a sore shoulder from cras... rec.sport.baseball\n",
"18132 Can someone send me ticket ordering informatio... rec.sport.baseball"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"input_docs = topicdb.q.samples([0,7,8,33,18132]).metadata()\n",
"input_docs"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fe411ff4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" layer | \n",
" cluster | \n",
"
\n",
" \n",
" | uid | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | AAQH | \n",
" rec.sport | \n",
" 1 | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name layer cluster\n",
"uid \n",
"AAQH rec.sport 1 6"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"topicdb.q.samples([0,7,8,33,18132]).theme().metadata()"
]
},
{
"cell_type": "markdown",
"id": "32c64351",
"metadata": {},
"source": [
"Mathematically, one could define the theme of the set of documents to be the least upper bound (in the topic tree) of the set of topics containing at least one of the documents. That is, *theme* should be equivalent to *topics + least upper bound*:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c6820012",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" layer | \n",
" cluster | \n",
"
\n",
" \n",
" | uid | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | AAQH | \n",
" rec.sport | \n",
" 1 | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name layer cluster\n",
"uid \n",
"AAQH rec.sport 1 6"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"topicdb.q.samples([0,7,8,33,18132]).topics().least_upper_bound().metadata()"
]
},
{
"cell_type": "markdown",
"id": "df0680b0",
"metadata": {},
"source": [
"Although this makes mathematical sense, it has one major drawback in practice. Often you will have a set of documents where *most* of the documents share a common theme, but there are a couple spurrious other documents. These spurrious documents can be on topics in a completely separate branch of the topic tree, meaning that the least upper bound of the set of all topics ends up very far up the tree.\n",
"\n",
"For example, let's throw in an extra document from `rec.autos` in there and compare the two queries: "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3b96243b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" post newsgroup\n",
"0 \\n\\nI am sure some bashers of Pens fans are pr... rec.sport.hockey\n",
"7 \\n[stuff deleted]\\n\\nOk, here's the solution t... rec.sport.hockey\n",
"8 \\n\\n\\nYeah, it's the second one. And I believ... rec.sport.hockey\n",
"33 \\nBe patient. He has a sore shoulder from cras... rec.sport.baseball\n",
"18132 Can someone send me ticket ordering informatio... rec.sport.baseball\n",
"18169 After a tip from Gary Crum (crum@fcom.cc.utah.... rec.autos\n",
"==================================================\n",
"Least Upper Bound of Topics is: rec\n",
"Theme of the Documents is: rec.sport\n"
]
}
],
"source": [
"print(topicdb.q.samples([0,7,8,33,18132,18169]).metadata())\n",
"print(\"=\"*50)\n",
"least_upper_bound_result = topicdb.q.samples([0,7,8,33,18132,18169]).topics( \n",
").least_upper_bound().metadata()['name'].values[0]\n",
"\n",
"theme_result = topicdb.q.samples([0,7,8,33,18132,18169]).theme().metadata()['name'].values[0]\n",
"\n",
"print(f\"Least Upper Bound of Topics is: {least_upper_bound_result}\")\n",
"print(f\"Theme of the Documents is: {theme_result}\")\n"
]
},
{
"cell_type": "markdown",
"id": "5e490df3",
"metadata": {},
"source": [
"The least upper bound of `{rec.sport.hockey, rec.sport.baseball, rec.autos}` is `rec`, but the overall theme of the documents is still `rec.sport`. The difference can be even more pronounced if we do nearest-neighbour search to obtain our input documents:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3d9bdf85",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" post \\\n",
"6623 BASEBALL.....ALWAYS\\nhow he\\n\\n \n",
"12791 Since someone brought up sports radio, howabou... \n",
"7945 I found this press release from Trial Lawyers ... \n",
"13714 boxscores.\\n\\nBeware. The original poster loo... \n",
"15310 \\n\\t\\t\\t\\t\\tart\\n \n",
"4950 Can some on e give me some stats on Forsrg in ... \n",
"11528 \\n\\n\\nWoops! This is rec.sport.hockey! Not re... \n",
"1423 Archive-name: rec-autos/part1\\n\\n[most recent ... \n",
"15420 ------------------------- Original Article ---... \n",
"11480 One week to the Robot Olympic games. Fire up ... \n",
"5376 \\nDancing With Idjits.\\n\\n\\n \n",
"15541 I sent a version of this post out a while ago,... \n",
"15206 Here's an easy question for someone who knows ... \n",
"11932 \\nAnd thus, we come to one of the true beautie... \n",
"16342 \\n\\n\\n\\nI hear ya, brother.\\n\\n ^^^^^^^... \n",
"\n",
" newsgroup \n",
"6623 rec.sport.baseball \n",
"12791 rec.sport.baseball \n",
"7945 rec.sport.hockey \n",
"13714 rec.sport.baseball \n",
"15310 comp.sys.mac.hardware \n",
"4950 rec.sport.hockey \n",
"11528 rec.sport.hockey \n",
"1423 rec.autos \n",
"15420 rec.sport.baseball \n",
"11480 sci.electronics \n",
"5376 rec.motorcycles \n",
"15541 rec.sport.baseball \n",
"15206 rec.sport.baseball \n",
"11932 rec.sport.baseball \n",
"16342 rec.sport.baseball \n",
"==================================================\n",
"Least Upper Bound of Topics is: root\n",
"Theme of the Documents is: rec.sport\n"
]
}
],
"source": [
"print(topicdb.q.neighbours(\"sports\", k=15).metadata())\n",
"print(\"=\"*50)\n",
"least_upper_bound_result = topicdb.q.neighbours(\"sports\", k=15).topics( \n",
").least_upper_bound().metadata()['name'].values[0]\n",
"\n",
"theme_result = topicdb.q.neighbours(\"sports\", k=15).theme().metadata()['name'].values[0]\n",
"\n",
"print(f\"Least Upper Bound of Topics is: {least_upper_bound_result}\")\n",
"print(f\"Theme of the Documents is: {theme_result}\")\n"
]
},
{
"cell_type": "markdown",
"id": "6b813e21",
"metadata": {},
"source": [
"The nearest-neighbour results are sufficient spread across the set of topics that the their least upper bound in the topic tree is the root node!\n",
"\n",
"So how does the `theme` function get around this? Theme searches for a topic that minimizes an objective function that simultaneously tries to maximize the number of documents included in the topic (weighted by inclusion strength) while minimizing how far up the topic tree the node is. Instead of a \"least upper bound\", its more of a \"lowish, almost-upper-bound\". \n",
"\n",
"This trade-off makes the thematic search much more robust to imperfectly categorized data, especially when we are working with datasets that have soft cluster strengths."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.14"
}
},
"nbformat": 4,
"nbformat_minor": 5
}