Searching 20-Newsgroups

In this notebook, we will show you how to prepare the 20-Newsgroups dataset for Thematic Search.

This dataset consists of ~18,000 posts from Usenet which are sorted into newsgroups that have a hierarchical structure. For example, two newsgroups are called rec.sport.hockey and rec.sport.baseball, which are both groups under the namespace rec.sport, itself under the namespace rec. We can use this hierarchical structure to build the topic tree for our thematic search.

To begin, we will fetch a dataset from HuggingFace that contains 20-Newsgroups with precomputed embeddings and UMAP reduced vectors:

[1]:
import thematic_search as ts
import numpy as np
import pandas as pd

newsgroups_df = pd.read_parquet("hf://datasets/lmcinnes/20newsgroups_embedded/data/train-00000-of-00001.parquet")
newsgroups_df.head(1)
[1]:
post newsgroup embedding map
0 \n\nI am sure some bashers of Pens fans are pr... rec.sport.hockey [-0.04380008950829506, 0.08495834469795227, -0... [-0.13199903070926666, 10.1972017288208]

Building the Topic Tree

The first thing we need to do is build the cluster tree that we will pass to thematic_search.TopicDatabase. The cluster tree should be a dictionary with entries { vertex:[child_1, child_2,...,child_n]}, where each vertex is a tuple (layer, cluster).

First, we will build our a tree by assigning each vertex a parent according to its name structure - rec.sport.hockey’s parent will be rec.sport. Using this we can form a dictionary with keys given by vertices and values given by lists of children.

Then we can convert this dictionary into the required form using thematic_search.utils.convert_string_tree. This takes a tree of strings and converts it to a tree of tuples. It returns the tree, cluster_tree and a dictionary cluster_labels that maps clusters (l,c) to their string names.

The utility thematic_search.utils.print_tree can be used to print the tree and check that it is correct.

[2]:
tags = np.unique(newsgroups_df['newsgroup'].to_numpy())

from collections import defaultdict
def build_tree(paths):
    tree = defaultdict(set)
    tree["root"]
    for p in paths:
        parts = p.split(".")
        for i in range(len(parts)):
            node = ".".join(parts[:i+1])
            parent = "root" if i == 0 else ".".join(parts[:i])
            tree[parent].add(node)
            tree[node]
    return {k: sorted(v) for k, v in tree.items()}

tree = build_tree(tags)

cluster_tree, cluster_labels = ts.utils.convert_tree(tree)
ts.utils.print_tree(cluster_tree, cluster_labels=cluster_labels)
root
--alt
----alt.atheism
--comp
----comp.graphics
----comp.os
------comp.os.ms-windows
--------comp.os.ms-windows.misc
----comp.sys
------comp.sys.ibm
--------comp.sys.ibm.pc
----------comp.sys.ibm.pc.hardware
------comp.sys.mac
--------comp.sys.mac.hardware
----comp.windows
------comp.windows.x
--misc
----misc.forsale
--rec
----rec.autos
----rec.motorcycles
----rec.sport
------rec.sport.baseball
------rec.sport.hockey
--sci
----sci.crypt
----sci.electronics
----sci.med
----sci.space
--soc
----soc.religion
------soc.religion.christian
--talk
----talk.politics
------talk.politics.guns
------talk.politics.mideast
------talk.politics.misc
----talk.religion
------talk.religion.misc

Building the Topic Metadata

Next, we need to make a pandas dataframe with metadata about the topics. This dataframe’s index must match the keys of your cluster tree. In our case, our cluster tree’s keys are the newsgroups’ names. Having these equal allows the TopicDatabase to correctly match rows of the topic dataframe with vertices of the cluster tree.

For the 20-newsgroups dataset, each topic has a string name that we also want to store, and we may as well also include the layer and cluster number in the metadata:

[3]:
tag_to_tuple = {v:k for k,v in cluster_labels.items()}

data = []
for tag in tree.keys():
    layer, cluster = tag_to_tuple[tag]
    data.append({
        'index':tag,
        'name':tag,
        'layer':layer,
        'cluster':cluster,
    })

topic_df = pd.DataFrame(data)
topic_df = topic_df.set_index('index')
topic_df.head(2)
[3]:
name layer cluster
index
root root 5 0
alt alt 1 0

Building the Topic Inclusion Matrices

The final bit of information we need to construct a TopicDatabase are the matrices encoding the inclusion strengths of each sample in each topic. For 20-newsgroups, this will be a binary matrix.

We need a list of matrices, one for each layer of the cluster hierarchy. For 20-newsgroups, the upper topics (such as alt) don’t have their own posts, but we will any post in a topic’s children to also be in the topic. As this is a common situation, there is a utility thematic_search.utils.cluster_layers_from_leaf_matrix that only requires us to build the matrix for the layer-0 nodes.

[4]:
leaves = [(l,c) for l,c in cluster_tree.keys() if l == 0]

n_samples = len(newsgroups_df)
n_leaves = len(leaves)

leaf_matrix = np.zeros((n_samples, n_leaves))

for l,c in leaves:
    tag_name = cluster_labels[(l, c)]
    leaf_matrix[:, c] = newsgroups_df['newsgroup'].apply(
        lambda x: x ==tag_name
    )

cluster_matrices = ts.utils.cluster_layers_from_leaf_matrix(
    cluster_tree, leaf_matrix
)

for matrix in cluster_matrices:
    print(matrix.shape)

(18170, 20)
(18170, 11)
(18170, 5)
(18170, 1)
(18170, 1)

Initializing the Database

Now we have everything we need to intitialize the database for thematic search. We’re going to load the sentence-transformers model that was used to embed the dataset, so we have it available for semantic nearest-neighbour search. Everything else was constructed in the previous steps!

[5]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

soft_cluster_tree = ts.SoftClusterTree(
    cluster_matrices,
    cluster_tree,
)
topic_db = ts.TopicDatabase(
    soft_cluster_tree = soft_cluster_tree,
    embedding_vectors = np.stack(newsgroups_df['embedding'].values),
    reduced_vectors = np.stack(newsgroups_df['map'].values),
    document_df = newsgroups_df[['post', 'newsgroup']],
    topic_df = topic_df,
    cluster_labels = cluster_labels,
    embedding_model = model,
)
topic_db
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
MPNetModel LOAD REPORT from: sentence-transformers/all-mpnet-base-v2
Key                     | Status     |  |
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  |

Notes:
- UNEXPECTED    :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
[5]:
<thematic_search.topicdatabase.TopicDatabase at 0x1d186bc7b60>

Example Queries

Finally, all that’s left to do is query the dataset!

First, we can query by topic name; we will query for “documents inside the topic named rec.sport”.

[6]:
topic_db.q.topic_name('rec.sport').inside().metadata()
[6]:
post newsgroup
0 \n\nI am sure some bashers of Pens fans are pr... rec.sport.hockey
7 \n[stuff deleted]\n\nOk, here's the solution t... rec.sport.hockey
8 \n\n\nYeah, it's the second one.  And I believ... rec.sport.hockey
24 I don't know the exact coverage in the states.... rec.sport.hockey
33 \nBe patient. He has a sore shoulder from cras... rec.sport.baseball
... ... ...
18132 Can someone send me ticket ordering informatio... rec.sport.baseball
18135 \n\n\n\nSaku isn't that small any longer I gue... rec.sport.hockey
18152 \n \n Well, I'm a Wings fan and I think the F... rec.sport.hockey
18154 \n\n\n\n\n   Anaheim. rec.sport.baseball
18161 \nAnd won't they have to change their name to ... rec.sport.hockey

1927 rows × 2 columns

Next, let’s query by semantic search. We can query for “information about the theme of the documents semantically close to the string ‘Recent advancements in space exploration’”.

[7]:
topic_db.q.neighbours("Recent advancements in space exploration").neighbours().theme().info()
[7]:
name layer cluster
index
14 sci.space 0 14

Let’s do one more example of searching for a theme. We will pick out a handful of documents from rec.sport.hockey and rec.sport.baseball, and ask what their theme is:

[8]:
print(topic_db.q.from_docs([0,7,8,33,18132]).metadata())

topic_db.q.from_docs([0,7,8,33,18132]).theme().info()
                                                    post           newsgroup
0      \n\nI am sure some bashers of Pens fans are pr...    rec.sport.hockey
7      \n[stuff deleted]\n\nOk, here's the solution t...    rec.sport.hockey
8      \n\n\nYeah, it's the second one.  And I believ...    rec.sport.hockey
33     \nBe patient. He has a sore shoulder from cras...  rec.sport.baseball
18132  Can someone send me ticket ordering informatio...  rec.sport.baseball
[8]:
name layer cluster
index
26 rec.sport 1 6

As you probably expected, the theme is rec.sport; this is the least upper bound that contains the documents in our query.