LDA vs SVD: Building Text Topics

Generated with Dalle-2: “a humanoid robot wandering through a medieval library painted with a pencil and watercolors”

It is currently July 2023 and Large Language Models have been all the rage for the past several months. So why go out and dive into the world of “traditional” NLP now? There are several reasons but let me focus on just three of them:

Interpretability, “traditional” topic modeling techniques - like LDA and SVD - assign probabilities to words for each topic, allowing for a clear understanding of the underlying themes. This is crucial in many sensitive and regulate industries
Resource Efficiency, training and scoring text topic models are cheap when it comes to computational resource consumption when compared to LLMs
Domain-Specific Modeling, with topic modeling it is straightforward to tailor the model to a specific domain or dataset by incorporating domain knowledge. This process can help to capture nuances and produce more relevant topics

With that out of the way let us take a look at two approaches for Text Topic Modeling - Latent Dirichlet Allocation (LDA) and Singular Value Decomposition (SVD). The goal of this blog post is to give you an introduction to these two algorithms and give you some guidance when to use them. A long with this blog post I will also make a SAS Custom Step available that makes trying out these two algorithms super easy.

Intro to LDA

LDA assumes each document is a mixture of topics, and each topic is a mixture of words. It considers that documents are generated based on the combination of topics and the distribution of words within those topics. This leads to very easily interpretable results as the direct contribution of each word to the assignment of a document to a topic can be directly displayed. You can order the results to see which words contribute the most to each topic.

Because LDA uses, by default, a Bag-of-Word approach it is imperative to apply a Stop-Word-List (i.e. remove very common words like and, or, to, by, …) from the text, as those will otherwise completely out rank other words in their contribution feature.

As LDA models all dependencies you can also take a look at how different topics relate to each other. This can be done by checking which words they share/have in common.

Intro to SVD

SVD is a linear algebra technique that is not dedicated to topic modeling. SVD decomposes a matrix into three components: a term-document matrix, which represents the co-occurrence of terms and documents, a diagonal matrix of singular values, and a matrix of topic-term weights. It aims to find the most important underlying latent factors (topics) that explain the variability in the data.

The topics are generated by reducing the number of dimensions to capture the most important factors. Analyzing/Interpreting the results of SVD is a bit more complicated as you have to use the term-topic-matrix. Each row represents a term and the corresponding values represent the term’s association with the topics.

Because of the dimensionality reduction SVD scales really well even to very large datasets.

LDA vs SVD

While LDA focuses on probabilistic modeling, SVD emphasizes dimensionality reduction, meaning it is deterministic in its nature.

LDA estimates the document-topic distribution, which represents the probability of each topic appearing in each document. SVD does not explicitly model document-topic distributions but provides a document-topic matrix where each entry represents the document's representation across topics.

In general LDA is considered to be more easily interpretable than SVD, but it is computationally more expansive then SVD. And with a little bit of work SVD can also be made as interpretable as LDA.

Let’s take a look at an example implementation of LDA and SVD in SAS. We have two CAS Action Sets that we will take a look at (for a full Syntax reference please refer to the linked SAS Documentation in the sources section). First we will take a look at the ldaTopic Action Set:

proc cas;
	ldaTopic.ldaTrain result = resLT /
		table = {caslib = 'public', name = 'news_data'}
		docID = 'documentID'
		text = {'headline_text'}
		k = 10
		stopWords = {caslib = 'public', name = 'news_data_stopList_LDA'}
		casOut = {caslib = 'public', name = 'news_data_ldaTopicDis', replace = True};

	print resLT;
run; quit;

I decided on using the SAS default stop word list and aiming for 10 topics on the data set (also linked in the sources). Here is a sample of the data - which is Australian news headlines:

Obs	publish_date	headline_text
1	20170717	canberra formal free charity
2	20170717	can geelong ease cost of housing pressures in melbourne
3	20170717	cat survives poisoning vodka treatment
4	20170717	china gdp economic data
5	20170717	citizen naturalists explore outback for undocumented species

And here are the 10 default topics that LDA generated for me (note I did no further hyperparameter tuning at all):

Apologies for the oversight. Here's the updated table ordered by ascending topic ID:

TopicID	topicTerms
0	police, man, new, says, interview
1	police, new, man, says, court
2	police, man, new, says, court
3	new, police, says, man, australia
4	police, man, new, court, says
5	police, new, man, says, fire
6	police, new, man, says, interview
7	police, new, man, says, court
8	police, new, man, says, nsw
9	police, new, man, says, nsw

Next up is the textMining Action Set where we will be using the same data, specify 10 topics and apply the SAS default stop word list as well:

proc cas;
	textMining.tmMine result = resTM /
		documents = {caslib = 'public', name = 'news_data'}
		docID = 'documentID'
		text = 'headline_text'
		k = 10
		stopList = {caslib = 'public', name = 'news_data_stoplist'}
		topics = {caslib = 'public', name = 'news_data_svdTopics', replace = True};

	print resTM;
run; quit;

The resulting topics look like this:

TopicId	Name
1	+man, +charge, +murder, +court, +stab
2	+say, not, australia, +need, +man
3	+hour, country, +country hour, nsw, tas
4	+police, +investigate, +death, +probe, +officer
5	rural, news, national, national rural news, abc
6	+interview, extended, +extended interview, michael, nrl
7	+crash, +kill, +car, +die, +woman
8	+new, +year, zealand, +man, +law
9	+council, govt, +urge, +plan, +fire
10	australia, +court, +day, +face, +world

Also for SVD I did not apply any hyperparameter tuning.

Conclusion

From this simple approach you can see one thing that I didn’t dive to deep on in the comparison itself. But from my experience using SVD for topic modeling yields pretty good results very easy, while with LDA you have to bring prior subject knowledge with you. In LDA it also pays of a lot do to Alpha (document-topic distribution) and Beta (topic-word distribution) hyperparameter tuning to achieve good results.

The example data used in this case, being of course also a very special case with mostly consisting of just one sentence long texts and having only around 7 tokens per sentence. LDA usually seems to fair better on longer medium texts then these super short once. But this data also reflects my experience well, where SVD seems to very easy and powerful to use.

You can find the full source code for the examples demonstrated in this blog post over on my GitHub Repository.

Sources

Here you can find a list of all the articles, papers, videos, documentation I used during the process of writing this article: