Thematic Search with Unsupervised Topic Modeling

In this page, we will create a TopicDatabase of the United Nations General Debate Corpus so we can do Thematic Search.

This corpus does not come with any topic metadata, so our first step is to generate a topic hierarchy and assign topic strengths to each document in the dataset. To do this, we will be using a typical unsupervised topic modeling pipeline: vectorize the documents, dimension reduce, cluster, and then represent the clusters with topic metadata.

Setting up the Topic Database

The first steps in the topic modeling pipeline is to chunk our documents and embed the chunks. As this is standard practice, we’re going to grab a copy of the dataset from HuggingFace that has pre-computed chunks, embedding vectors and UMAP reduced vectors. The dataset as well as more details on its creation is available here.

[1]:
import pandas as pd
import numpy as np

ungdc_df = pd.read_parquet("https://huggingface.co/datasets/kalebr/un-general-debate-corpus-chunked/resolve/main/ungdc-all-chunked.parquet")


embedding_vectors = np.stack(ungdc_df['embedding'].values)
reduced_vectors = np.stack(ungdc_df['reduced'].values)
text = ungdc_df['chunk_text'].to_numpy()

ungdc_df.head(1)
[1]:
original_index chunk_text token_count embedding reduced session year country information_weight
0 0 It is indeed a pleasure for me and the member... 107 [0.0061351038, 0.020647585, 0.01789057, 0.0222... [6.317007541656494, 7.351022243499756] 44 1989 MDV 39.726044

Computing a Topic Hierarchy

Next, we will use Toponymy to generate a topic hierarchy and assign topic labels to the chunks in our dataset. Toponymy first uses HDBSCAN to generate a set of cluster layers that are organized into a tree, and assigns each vector a cluster label for each layer of the hierarchy (possibly -1 for noise). Then it uses an LLM to generate topic names for each cluster by passing it context about the cluster, such as keyphrases and example snippets.

First we need to initialize our LLM wrapper - I am using Claude 3 Haiku - and then we need to set up our Toponymy parameters.

[2]:
## Toponymy LLM Wrapper Setup
from toponymy.llm_wrappers import AnthropicNamer

llm = AnthropicNamer(
    api_key="your-api-key-here",
    model="claude-3-haiku-20240307",  # Fast and cost-effective
    llm_specific_instructions="Generate coherent, descriptive names"
)

## Toponymy settings
model_name = 'all-MiniLM-L6-v2' # embedding model.

toponymy_object_description = "excerpts from a speech"
toponymy_corpus_description = "United Nations General Debate Transcripts"

toponymy_exemplar_method = "central"
toponymy_keyphrase_method = "information_weighted"
toponymy_subtopic_method = "facility_location"

clusterer_params = {
    'min_clusters':4,
    'base_min_cluster_size':25,
    'verbose':True
}
[3]:
import os
print(os.getcwd())

C:\Users\User\Documents\topicdb

Next, we can run Toponymy. Since I’ve already run it myself, I am going to load the serialized output instead:

[4]:
from toponymy.serialization import TopicModel
from sentence_transformers import SentenceTransformer
import torch

if False:
    from toponymy import Toponymy, ToponymyClusterer

    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f"Using device: {device}")

    model = SentenceTransformer(model_name, device=device)
    print(f"Loaded model: {model_name}")

    clusterer = ToponymyClusterer(**clusterer_params)
    clusterer.fit(reduced_vectors, embedding_vectors)
    toponymy_params = {
        'llm_wrapper':llm,
        'text_embedding_model':model,
        'clusterer':clusterer,
        'object_description':"excerpts from a speech",
        'corpus_description':"United Nations General Debate Transcripts",
        'exemplar_delimiters':["<EXAMPLE_TRANSCRIPT>\n","\n</EXAMPLE_TRANSCRIPT>\n\n"],
    }
    toponymy_fit_params = {
        'exemplar_method':toponymy_exemplar_method,
        'keyphrase_method':toponymy_keyphrase_method,
        'subtopic_method':toponymy_subtopic_method,
    }

    toponymy = Toponymy(**toponymy_params)
    toponymy.fit(text, embedding_vectors, reduced_vectors, **toponymy_fit_params)

    topicmodel = TopicModel.from_toponymy(toponymy, document_df=df)

if True:
    from pathlib import Path

    device = 'cuda' if torch.cuda.is_available() else 'cpu'

    if not Path('ungdc-full-topic-model.tm.zip').is_file():
        !curl -O https://huggingface.co/datasets/kalebr/un-general-debate-corpus-chunked/resolve/main/ungdc-full-topic-model.tm.zip

    model = SentenceTransformer("all-mpnet-base-v2", device=device)
    print(f"Loaded model: all-mpnet-base-v2")

    topicmodel = TopicModel.from_file('ungdc-full-topic-model.tm.zip')
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
MPNetModel LOAD REPORT from: sentence-transformers/all-mpnet-base-v2
Key                     | Status     |  |
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  |

Notes:
- UNEXPECTED    :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Loaded model: all-mpnet-base-v2

Toponymy’s TopicModel class contains the following data from Toponymy:

  • The cluster_tree topicmodel.cluster_tree

  • The topic representation metadata topicmodel.topic_df

  • The embedding vectors topicmodel.embedding_vectors

  • The reduced vectors topicmodel.reduced_vectors

  • The cluster inclusion matrices topicmodel.cluster_layers

Conveniently, the TopicModel contains everything you need to build a TopicDatabase. For example, here is the topic metadata:

[5]:
topicmodel.topic_df
[5]:
uid layer cluster name keyphrases
0 AAAB 0 0 Conflict and political reconciliation in North... [people of northern ireland, british and irish...
1 AAAC 0 1 Antarctica's status as the common heritage of ... [question of antarctica, antarctic treaty, con...
2 AAAD 0 2 Implementation of South Tyrol Autonomy Agreeme... [autonomy of south tyrol, austria and italy, r...
3 AAAE 0 3 Decolonization and self-determination of New C... [people of new caledonia, south pacific forum,...
4 AAAF 0 4 Decolonization and self-determination of East ... [people of east timor, leadership of fretilin,...
... ... ... ... ... ...
613 AAwE 3 3 Latin American Integration and Conflicts [central american countries, government of gua...
614 AAwF 3 4 Congratulations and Tributes to UN Leadership [election to the presidency, mr president, sir...
615 AAwG 3 5 Nuclear Disarmament and European Security [non-proliferation of nuclear weapons, nuclear...
616 AAwH 3 6 UN Reform and Global Order [disarmament, nuclear weapons, developing coun...
617 AAwI 3 7 Reforming Global Economic Order [developing countries, terms of trade, new int...

618 rows × 5 columns

Propagating Labels to Noise Points

Although we could make a TopicDatabase with what we have now, I want to first assign topic strengths to all the noise points by using scikit-learn’s LabelSpreading algorithm to predict soft inclusion strengths for each topic to each noise point, using the Topoynymy labels as training data.

[6]:
from sklearn.semi_supervised import LabelSpreading

spreader = LabelSpreading(kernel='knn')

soft_layers = []
for matrix in topicmodel.cluster_layers:
    labels = np.asarray(np.argmax(matrix,axis=1))
    spreader.fit(topicmodel.embedding_vectors, labels.flatten())
    soft_layers.append(spreader.label_distributions_)

Initializing the Topic Database

Now we have everything we need to initialize the topic database:

[10]:
from thematic_search import TopicDatabase, SoftClusterTree

topicdb = TopicDatabase(
    soft_cluster_tree=SoftClusterTree(
        soft_layers,
        topicmodel.cluster_tree,
    ),
    embedding_vectors=topicmodel.embedding_vectors,
    reduced_vectors=topicmodel.reduced_vectors,
    sample_df=ungdc_df.drop(columns=['embedding', 'reduced']),
    topic_df=topicmodel.topic_df,
    embedding_model=model
)
topicdb
[10]:
<thematic_search.topicdatabase.TopicDatabase at 0x2c4a16d4690>

Searching and Visualizing

Now with our chunk database and speech database both ready to go, lets go through some example usage.

First, lets query for high-level (layer>=3) topics in the speeches given by the Canadian delegation, with strength at least 0.75:

[12]:
topicdb.q.samples_where("country=='CAN'").topics(0.9).where('layer>=3').metadata()
[12]:
uid layer cluster name keyphrases
610 AAwB 3 0 Southeast Asian Geopolitics [government of democratic kampuchea, vietnames...
611 AAwC 3 1 Middle East Conflicts [palestinian people, israel, occupied arab ter...
612 AAwD 3 2 Decolonization and Anti-Apartheid [people of south africa, south african governm...
613 AAwE 3 3 Latin American Integration and Conflicts [central american countries, government of gua...
615 AAwG 3 5 Nuclear Disarmament and European Security [non-proliferation of nuclear weapons, nuclear...
616 AAwH 3 6 UN Reform and Global Order [disarmament, nuclear weapons, developing coun...
617 AAwI 3 7 Reforming Global Economic Order [developing countries, terms of trade, new int...

Next, we will find the 25 nearest neighbours of the embedding of “war between Iran and Iraq”, ask for the theme of that set and print it’s name.

Then we can take that theme and find all the documents in the theme with strength at least 0.9.

[13]:
topic = topicdb.q.neighbours("Soviet union invading Afghanistan",k=25).theme().metadata()
topic
[13]:
uid layer cluster name keyphrases
462 AAQO 1 13 Withdrawal of Soviet Troops and Political Sett... [people of afghanistan, pakistan and iran, gen...
[17]:
topicdb.q.topic_idx(topic.index.values[0]).samples().metadata().head()
[17]:
original_index chunk_text token_count session year country information_weight
7154 811 Guido de Marco for the very able manner in whi... 177 46 1991 JPN 46.473844
7204 814 Before elaborating on this, I should like to e... 209 46 1991 ROU 48.322461
9001 939 Whatever the particular economic, political or... 121 41 1986 SOM 39.224018
12373 1320 207.\tThere is a clamor that rises from the de... 225 30 1975 ECU 57.556131
17114 2112 A united response is most urgently needed for ... 138 43 1988 CYP 37.285434

For a bit of visualization, we can colour a scatter plot of the samples by their inclusion strength in a topic of choice. We can also compare the topology of thematic and semantic search:

[28]:
import matplotlib.pyplot as plt

topic = (3,3)

topic_info = topicdb.q.topic(*topic).info()

interior_index = topicdb.q.topic(*topic).samples(0.95).indices
bdry_index = topicdb.q.topic(*topic).samples(0.01).indices
strengths = topicdb.soft_cluster_tree.strengths(topic_info.index[0], bdry_index)

fig,(ax1,ax2)=plt.subplots(1,2, figsize=(12,5))
ax1.set_title(f"Thematic search for '{topic_info.name.values[0]}'")
ax1.scatter(
    topicdb.reduced_vectors[:,0], topicdb.reduced_vectors[:,1],
    s=1, c='k', alpha=0.5
)
sc=ax1.scatter(
    *topicdb.reduced_vectors[bdry_index].T,
    s=5, c=strengths, cmap='RdYlGn',
    alpha=0.5
)

plt.colorbar(sc, label="Cluster inclusion strength")
ax1.set_xticks([])
ax1.set_yticks([])
ax2.scatter(
    *topicdb.reduced_vectors.T, s=1, c='k', alpha=0.5
)

colours = [
    'red','orange','yellow','green'
]
ax2.set_title(f"Semantic search for '{topic_info.name.values[0]}'")
for i, k in enumerate([2500,1000,500,100]):
    nn_idx = topicdb.q.neighbours(topic_info.name.values[0], k=k).indices
    vectors = topicdb.reduced_vectors
    ax2.scatter(
        vectors[nn_idx, 0],
        vectors[nn_idx, 1],
        s=5,
        c=colours[i],
        alpha=0.5,
        label=f"k={k}"
    )
ax2.legend()
ax2.set_xticks([])
ax2.set_yticks([])
plt.show()


_images/un-toponymy_19_0.png