Searching 20-Newsgroups
In this notebook, we will show you how to prepare the 20-Newsgroups dataset for Thematic Search.
This dataset consists of ~18,000 posts from Usenet which are sorted into newsgroups that have a hierarchical structure. For example, two newsgroups are called rec.sport.hockey and rec.sport.baseball, which are both groups under the namespace rec.sport, itself under the namespace rec. We can use this hierarchical structure to build the topic tree for our thematic search.
To begin, we will fetch a dataset from HuggingFace that contains 20-Newsgroups with precomputed embeddings and UMAP reduced vectors:
[1]:
import thematic_search as ts
import numpy as np
import pandas as pd
newsgroups_df = pd.read_parquet("hf://datasets/lmcinnes/20newsgroups_embedded/data/train-00000-of-00001.parquet")
newsgroups_df.head(1)
[1]:
| post | newsgroup | embedding | map | |
|---|---|---|---|---|
| 0 | \n\nI am sure some bashers of Pens fans are pr... | rec.sport.hockey | [-0.04380008950829506, 0.08495834469795227, -0... | [-0.13199903070926666, 10.1972017288208] |
Building the Topic Tree
The first thing we need to do is build the cluster tree that we will pass to thematic_search.TopicDatabase. The cluster tree should be a dictionary with entries { vertex:[child_1, child_2,...,child_n]}, where each vertex is a tuple (layer, cluster).
First, we will build our a tree by assigning each vertex a parent according to its name structure - rec.sport.hockey’s parent will be rec.sport. Using this we can form a dictionary with keys given by vertices and values given by lists of children.
Then we can convert this dictionary into the required form using thematic_search.utils.convert_string_tree. This takes a tree of strings and converts it to a tree of tuples. It returns the tree, cluster_tree and a dictionary cluster_labels that maps clusters (l,c) to their string names.
The utility thematic_search.utils.print_tree can be used to print the tree and check that it is correct.
[2]:
tags = np.unique(newsgroups_df['newsgroup'].to_numpy())
from collections import defaultdict
def build_tree(paths):
tree = defaultdict(set)
tree["root"]
for p in paths:
parts = p.split(".")
for i in range(len(parts)):
node = ".".join(parts[:i+1])
parent = "root" if i == 0 else ".".join(parts[:i])
tree[parent].add(node)
tree[node]
return {k: sorted(v) for k, v in tree.items()}
tree = build_tree(tags)
cluster_tree, cluster_labels = ts.utils.convert_tree(tree)
ts.utils.print_tree(cluster_tree, cluster_labels=cluster_labels)
root
--alt
----alt.atheism
--comp
----comp.graphics
----comp.os
------comp.os.ms-windows
--------comp.os.ms-windows.misc
----comp.sys
------comp.sys.ibm
--------comp.sys.ibm.pc
----------comp.sys.ibm.pc.hardware
------comp.sys.mac
--------comp.sys.mac.hardware
----comp.windows
------comp.windows.x
--misc
----misc.forsale
--rec
----rec.autos
----rec.motorcycles
----rec.sport
------rec.sport.baseball
------rec.sport.hockey
--sci
----sci.crypt
----sci.electronics
----sci.med
----sci.space
--soc
----soc.religion
------soc.religion.christian
--talk
----talk.politics
------talk.politics.guns
------talk.politics.mideast
------talk.politics.misc
----talk.religion
------talk.religion.misc
Building the Topic Metadata
Next, we need to make a pandas dataframe with metadata about the topics. This dataframe’s index must match the keys of your cluster tree. In our case, our cluster tree’s keys are the newsgroups’ names. Having these equal allows the TopicDatabase to correctly match rows of the topic dataframe with vertices of the cluster tree.
For the 20-newsgroups dataset, each topic has a string name that we also want to store, and we may as well also include the layer and cluster number in the metadata:
[3]:
tag_to_tuple = {v:k for k,v in cluster_labels.items()}
data = []
for tag in tree.keys():
layer, cluster = tag_to_tuple[tag]
data.append({
'index':tag,
'name':tag,
'layer':layer,
'cluster':cluster,
})
topic_df = pd.DataFrame(data)
topic_df = topic_df.set_index('index')
topic_df.head(2)
[3]:
| name | layer | cluster | |
|---|---|---|---|
| index | |||
| root | root | 5 | 0 |
| alt | alt | 1 | 0 |
Building the Topic Inclusion Matrices
The final bit of information we need to construct a TopicDatabase are the matrices encoding the inclusion strengths of each sample in each topic. For 20-newsgroups, this will be a binary matrix.
We need a list of matrices, one for each layer of the cluster hierarchy. For 20-newsgroups, the upper topics (such as alt) don’t have their own posts, but we will any post in a topic’s children to also be in the topic. As this is a common situation, there is a utility thematic_search.utils.cluster_layers_from_leaf_matrix that only requires us to build the matrix for the layer-0 nodes.
[4]:
leaves = [(l,c) for l,c in cluster_tree.keys() if l == 0]
n_samples = len(newsgroups_df)
n_leaves = len(leaves)
leaf_matrix = np.zeros((n_samples, n_leaves))
for l,c in leaves:
tag_name = cluster_labels[(l, c)]
leaf_matrix[:, c] = newsgroups_df['newsgroup'].apply(
lambda x: x ==tag_name
)
cluster_matrices = ts.utils.cluster_layers_from_leaf_matrix(
cluster_tree, leaf_matrix
)
for matrix in cluster_matrices:
print(matrix.shape)
(18170, 20)
(18170, 11)
(18170, 5)
(18170, 1)
(18170, 1)
Initializing the Database
Now we have everything we need to intitialize the database for thematic search. We’re going to load the sentence-transformers model that was used to embed the dataset, so we have it available for semantic nearest-neighbour search. Everything else was constructed in the previous steps!
[5]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
soft_cluster_tree = ts.SoftClusterTree(
cluster_matrices,
cluster_tree,
)
topic_db = ts.TopicDatabase(
soft_cluster_tree = soft_cluster_tree,
embedding_vectors = np.stack(newsgroups_df['embedding'].values),
reduced_vectors = np.stack(newsgroups_df['map'].values),
document_df = newsgroups_df[['post', 'newsgroup']],
topic_df = topic_df,
cluster_labels = cluster_labels,
embedding_model = model,
)
topic_db
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
MPNetModel LOAD REPORT from: sentence-transformers/all-mpnet-base-v2
Key | Status | |
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED | |
Notes:
- UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
[5]:
<thematic_search.topicdatabase.TopicDatabase at 0x1d186bc7b60>
Example Queries
Finally, all that’s left to do is query the dataset!
First, we can query by topic name; we will query for “documents inside the topic named rec.sport”.
[6]:
topic_db.q.topic_name('rec.sport').inside().metadata()
[6]:
| post | newsgroup | |
|---|---|---|
| 0 | \n\nI am sure some bashers of Pens fans are pr... | rec.sport.hockey |
| 7 | \n[stuff deleted]\n\nOk, here's the solution t... | rec.sport.hockey |
| 8 | \n\n\nYeah, it's the second one. And I believ... | rec.sport.hockey |
| 24 | I don't know the exact coverage in the states.... | rec.sport.hockey |
| 33 | \nBe patient. He has a sore shoulder from cras... | rec.sport.baseball |
| ... | ... | ... |
| 18132 | Can someone send me ticket ordering informatio... | rec.sport.baseball |
| 18135 | \n\n\n\nSaku isn't that small any longer I gue... | rec.sport.hockey |
| 18152 | \n \n Well, I'm a Wings fan and I think the F... | rec.sport.hockey |
| 18154 | \n\n\n\n\n Anaheim. | rec.sport.baseball |
| 18161 | \nAnd won't they have to change their name to ... | rec.sport.hockey |
1927 rows × 2 columns
Next, let’s query by semantic search. We can query for “information about the theme of the documents semantically close to the string ‘Recent advancements in space exploration’”.
[7]:
topic_db.q.neighbours("Recent advancements in space exploration").neighbours().theme().info()
[7]:
| name | layer | cluster | |
|---|---|---|---|
| index | |||
| 14 | sci.space | 0 | 14 |
Let’s do one more example of searching for a theme. We will pick out a handful of documents from rec.sport.hockey and rec.sport.baseball, and ask what their theme is:
[8]:
print(topic_db.q.from_docs([0,7,8,33,18132]).metadata())
topic_db.q.from_docs([0,7,8,33,18132]).theme().info()
post newsgroup
0 \n\nI am sure some bashers of Pens fans are pr... rec.sport.hockey
7 \n[stuff deleted]\n\nOk, here's the solution t... rec.sport.hockey
8 \n\n\nYeah, it's the second one. And I believ... rec.sport.hockey
33 \nBe patient. He has a sore shoulder from cras... rec.sport.baseball
18132 Can someone send me ticket ordering informatio... rec.sport.baseball
[8]:
| name | layer | cluster | |
|---|---|---|---|
| index | |||
| 26 | rec.sport | 1 | 6 |
As you probably expected, the theme is rec.sport; this is the least upper bound that contains the documents in our query.