What is Latent Dirichlet Allocation?
The problem that we would like to talk about here is to find the best segregation of documents according to their topics. This kind of separation can be used for assigning a topic to a document (which is called topic labeling) or for general determination of compressed characteristics of a huge text data set.
Latent Dirichlet Allocation (LDA) is an unsupervised machine learning method that is a state-of-the-art approach for this kind of problem. It is based on a straightforward mathematical probabilistic concept of Bayesian inference, but despite its tough theory, in the end, it is pretty simple to use. Bayesian inference is a way to get sharper predictions from your data. It’s particularly useful when you don’t have as much data as you would like and want to juice every last bit of predictive strength from it.
There are various possible applications of LDA: for finding questions on stack overflow that are similar to each other, news flow aggregation and analysis or any document categorization that is based on a group of structured text data. The first time it was used in biology on a multilocus genotype for population inference.
Latent Dirichlet Allocation - model description
No human investigation can be called real science
if it cannot be demonstrated mathematically.
Leonardo da Vinci
Latent Dirichlet Allocation is a generative probability model that is constructed along Bayes Inference. It provides distribution of outputs and inputs based on latent variables, whereas the only observed values (grayed out in the graph below) are words. This approach requires only a few assumptions to be made. One of them is the number of topics 𝐾 we want to distinguish. The others are 𝛼 and 𝛽 which we will mention later.
LDA has its pythonic implementation in sklearn package (sklearn.decomposition.LatentDirichletAllocation). A number of topics should be pass through the n_components parameter.
In [79]:
# Example distributions
As we search for the distributions we have a general idea of how they should look like.
This prior knowledge added to the model is the core of Bayesian statistics and provides better performance. For distributions 𝜙𝑘 we expect that some words will occur often in the 𝑘-th topic and the ones that will be very rare, therefore skewed distributions of 𝜙𝑘 would be appropriate. For distributions 𝜃𝑑 we usually assume that the topics are uniformly distributed across the 𝑑 -th document. Both situations can be reflected well using Dirichlet distribution, although with different parameters:
𝜃∼𝐷𝑖𝑟(𝛼) ,
𝜙∼𝐷𝑖𝑟(𝛽) ,
where:
- 𝜃=(𝜃1,𝜃2,...,𝜃𝑑,...,𝜃𝑀) ,
- 𝜙=(𝜙1,𝜙2,...𝜙𝑘,...,𝜙𝐾) ,
- 𝑀 - number of all documents,
- 𝐾 - number of topics
It's worth highlighting that random variable is drawn from 𝐷 𝑖𝑟(𝛼) and 𝐷 𝑖𝑟(𝛽) are matrices of 𝑀 and 𝐾 rows respectively. They are called \"priors\" and in sklearn can be provided with arguments doc_topic_prior, topic_word_prior. Of course, if we expect some topics to be more popular we can manage the 𝛼 parameter to bias the model toward those topics.
Iteration runs several times across all the documents in the corpus calculating the probability of the topic each time taking the output of a previous run as an input for the next one. For 𝑛 -th document we have the following formula:
- 𝑛𝑑𝑘 is the number of times document 𝑑 uses topic 𝑘,
- 𝑣𝑘𝑤 is the number of times topic 𝑘 uses given word 𝑤,
- 𝑧𝑑𝑘 is a probability of 𝑑-th word in 𝑘-th document and
- 𝑧−(𝑑𝑘) is the oposite probability.
In the above equation, we assume conditional independence. This procedure is called Gibbs sampling.
It is worth pointing out that 𝛼 and 𝛽 are in fact 𝑀 and 𝐾 dimensional vectors, 𝜃 is 𝐾×𝑀 matrix and 𝜙 is 𝑉×𝐾 matrix (𝑉 stands for vocabulary).
Using the idea of the generative process we assume that each document has a topic characterized by a distribution over all the words. That way we define
topic 𝑧𝑑𝑗∼𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝜃𝑑),
word 𝑤𝑑𝑗∼𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝜙𝑧𝑑𝑗)
for each of the word positions 𝑑,𝑗 where 𝑑∈{1,…,𝑀}, and 𝑗∈{1,…,𝑁𝑑}, where 𝑀 is a number of documents and 𝑁𝑑 is a length of a document 𝑑. Random variable 𝑤 is a matrix containing the probability of each document being to a certain topic.
LDA in Python
I use `Python 3.7` for below computation and `jupyter notebook` for my IDE. You can find detailed requirements in our GitHub repository
Before we start coding, let us set basic notation.
- Corpus is a complete set of documents. In our case, it will be the whole list of BBC news articles.
- The document is a text that has a certain topic, for us, it is just a particular article.
- Tokenization is the task of chopping up documents defined as character sequence into pieces, called tokens, perhaps at the same time throwing away certain characters.
- Stemming is a process of retrieving the root of each individual word of the document.
- Stopwords are words that do not carry any information that would be helpful in determining the document topic.
We will use the following external libraries:
In [ ]:
import glob
import nltk
import sklearn
import pandas
import bokeh
Data
The data was downloaded from Kaggle Datasets. For the corpus of our dataset, we will read all the downloaded articles into one list corpus. We will use the folder name as a list of labels - this will be handy later for validation of LDA effectiveness. If you create a python script in the same directory as the downloaded data you can use bellow code to import the data.
In [59]:
import glob
import os
base_dir = "BBC News Summary/News Articles"
# read news
business_file_list = glob.glob(os.path.join(os.getcwd(), base_dir, "business", "*.txt"))
entertainment_file_list = glob.glob(os.path.join(os.getcwd(), base_dir, "entertainment", "*.txt"))
politics_file_list = glob.glob(os.path.join(os.getcwd(), base_dir, "politics", "*.txt"))
sport_file_list = glob.glob(os.path.join(os.getcwd(), base_dir, "sport", "*.txt"))
tech_file_list = glob.glob(os.path.join(os.getcwd(), base_dir, "tech", "*.txt"))
labels = []
corpus = []
for file_list in [
business_file_list, entertainment_file_list, politics_file_list, sport_file_list, tech_file_list
]:
for file_path in file_list:
with open(file_path, encoding="utf8", errors='ignore') as f_input:
corpus.append((f_input.read()))
labels.append(file_path.split('/')[-2])
Data preprocessing
Before we can use our data to build a model we need to prepare it to a form that will be understood by the LDA algorithm.
It can be done by building a matrix that has documents for rows and a bag of words (list of all words from the corpus) for columns. That way defining a count of certain words in each document. To achieve a better performance will also clean the input data so that the maximum amount of information can be represented as the smallest matrix possible.
We will start with tokenization, which will split our documents into sentences and then into single words.
Tokenization
Since we imported the corpus, now we will tokenize each document. We will also lowercase all the words and remove nonalphabetic characters from them. We do that so we can count the number of certain words in each document later on. We see now that rare words or word variations contain relevant information as they would just be the tail of our distribution.
In [60]:
import nltk
import re
regex = re.compile('[^a-zA-Z]')
def tokenize(text):
tokens = []
for sent in nltk.sent_tokenize(text):
for word in nltk.word_tokenize(sent):
clean_word = regex.sub('', word)
tokens.append(clean_word.lower())
return tokens
Let's see how this looks like.
In [61]:
tokenized = tokenize(corpus[500])
tokenized[:10]
Out[61]:
['fiat',
'mulls',
'ferrari',
'market',
'listing',
'ferrari',
'could',
'be',
'listed',
'on']
In [78]:
# Example documents word counts
Stopwords
Now we will remove stopwords. Using the same logic we discussed while tokenizing, we now remove all the words that frequently appear in all documents. Their value for topic labeling is meaningless.
In [62]:
from nltk.corpus import stopwords as sw
stopwords = sw.words('english')
cleaned = [word for word in tokenized if word not in stopwords and word is not '']
cleaned[:10]
Out[62]:
['fiat',
'mulls',
'ferrari',
'market',
'listing',
'ferrari',
'could',
'listed',
'stock',
'market']
Stemming
The next step is to stem the remaining words. Here we are standardizing words to their root.
In [63]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
def stem(word):
return stemmer.stem(word).strip()
Let's take a look at how it went.
In [64]:
stemed = [stem(word) for word in cleaned]
stemed[:10]
Out[64]:
['fiat',
'mull',
'ferrari',
'market',
'list',
'ferrari',
'could',
'list',
'stock',
'market']
Perfect, the data is preprocessed and ready to apply LDA on it, but we would like to use a little improvement called Term Frequency–Inverse Document Frequency.
TF-IDF
The idea of Term Frequency - Inverse Document Frequency is to penalize the common words assigning them lower weights while giving importance to words that are rare in the entire corpus but appear frequently in few documents.
In the classic approach of LDA we use on input a matrix of word occurrences:
𝜔𝑡,𝑑 = 𝑛𝑡,𝑑
While TF-IDF is a product of two matrices:
𝑛𝑡,𝑑 - occurrence number of word 𝑡 in document 𝑑
which gives the weight for each word in the document and
𝑡𝑓𝑡 - number of documents containing word 𝑡t
𝑁 - number of documents
which calculates inverted weight of a word in the context of all documents, promoting rare words. The resulting matrix
has the highest values for words that are not common in the corpus.
Both LDA and TF-IDF matrices are sparse which can cause some computing problems. That fact is recognized as a disadvantage, but it can not really be addressed straightforwardly to the method itself.
LDA application
Let us now obtain the desired matrix, basically, all we need is a TfidfVectorizer class from sklearn library. We will use our previous work in parameters.
In [65]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=10000,
min_df=0.05, stop_words=stopwords,
use_idf=True, tokenizer=tokenize,
lowercase=True, preprocessor=stem)
For more details and all parameters explanation please see the sklearn documentation.
In [67]:
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus);
Latent Dirichlet Allocation
Applying Latent Dirichlet Allocation on prepared matrics:
In [68]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(tfidf_matrix);
We will check the 10 words with the highest probability for each topic.
In [69]:
for i,topic in enumerate(lda.components_):
print(f'Topic #{i}:')
print([tfidf_vectorizer.get_feature_names()[i] for i in topic.argsort()[-10:]])
print('\n')
Topic #0:
['award', 'awards', 'party', 'brown', 'election', 'show', 'labour', 'best', 'mr', 'film']
Topic #1:
['sales', 'economy', 'bank', 'growth', 'firm', 'market', 'year', 'company', 'us', 'bn']
Topic #2:
['first', 'injury', 'players', 'match', 'cup', 'club', 'england', 'nt', 'win', 'game']
Topic #3:
['court', 'minister', 'secretary', 'scotland', 'people', 'law', 'would', 'government', 'wales', 'mr']
Topic #4:
['net', 'computer', 'phone', 'digital', 'music', 'software', 'users', 'mobile', 'technology', 'people']
Looks like the topics could be: movies or entertainment in general, economy, sport, politics and technology.
In [70]:
topic_values = lda.transform(tfidf_matrix)
doc_num, topic_num = topic_values.shape
Let us wrap it into a Pandas Dataframe so we can analyze it more efficiently.
In [71]:
import pandas as pd
df = pd.DataFrame({'document': corpus, 'label': labels, 'lda': topic_values.argmax(axis=1)})
df.groupby(['label', 'lda']).count().unstack()
Out[71]:
Comparing "lda" results to "labels" that were taken from the data set, we can see that almost all of the articles were correctly specified. Only politics news were divided into entertainment and politics, but still, around 40% more were assigned correctly. That uncertainty is understandable, we saw that these two topic shared the same words.
Visualisation with bokeh
Let's look at the matrix of probabilities for each document to better understand retrieved results.
In [72]:
prob_matrix = lda.transform(tfidf_matrix)
Matrix transformation
In [73]:
from math import pi
import pandas as pd
from bokeh.io import show
from bokeh.models import LinearColorMapper, BasicTicker, PrintfTickFormatter, ColorBar
from bokeh.plotting import figure
from bokeh.sampledata.unemployment1948 import data
prob_matrix_df = pd.DataFrame(data=prob_matrix[0:,0:],
index=list(range(len(prob_matrix))),
columns=list(range(topic_num)))
prob_matrix_df
Out[73]:
2225 rows × 5 columns
In [74]:
prob_matrix_df['doc'] = list(range(doc_num))
prob_matrix_df = prob_matrix_df.set_index('doc')
prob_matrix_df.columns.name = 'topic'
df = pd.DataFrame(prob_matrix_df.stack(), columns=['rate']).reset_index()
Sample data
In [75]:
import random
rand = random.sample(range(1, 2555), 100)
df_rand = df.loc[df['doc'].isin(rand)]
df_rand.to_pickle("./df_rand.pkl")
df = pd.read_pickle("./df_rand.pkl")
df['doc'] = df['doc'].astype(str)
df['topic'] = df['topic'].astype(str)
df['rate'] = df['rate']*100
docs = list(set(df_rand['doc']))
docs.sort()
docs = [str(i) for i in docs]
docs;
topics = ['0','1','2','3','4']
Document - topic probability plot
In [76]:
from bokeh.io import output_notebook
In [81]:
from math import pi
import pandas as pd
from bokeh.io import show, output_file
from bokeh.models import LinearColorMapper, BasicTicker, PrintfTickFormatter, ColorBar
from bokeh.plotting import figure
output_file("lda_output_visualisation.html")
colors = ['#084594', '#2171b5', '#4292c6', '#6baed6', '#9ecae1', '#c6dbef', '#deebf7', '#f7fbff']
colors.reverse()
mapper = LinearColorMapper(palette=colors, low=df.rate.min(), high=df.rate.max())
TOOLS = "hover,save,pan,box_zoom,reset,wheel_zoom"
p = figure(title="Topic probability per document",
x_range=topics, y_range=docs,
x_axis_location="above", plot_width=400, plot_height=1200,
tools=TOOLS, toolbar_location='below',
tooltips=[('rate', '@rate%')]
)
p.grid.grid_line_color = None
p.axis.axis_line_color = None
p.axis.major_tick_line_color = None
p.axis.major_label_text_font_size = "5pt"
p.axis.major_label_standoff = 0
p.xaxis.major_label_orientation = pi / 3
p.rect(x="topic", y="doc", width=1, height=1,
source=df,
fill_color={'field': 'rate', 'transform': mapper},
line_color=None)
color_bar = ColorBar(color_mapper=mapper, major_label_text_font_size="5pt",
ticker=BasicTicker(desired_num_ticks=len(colors)),
formatter=PrintfTickFormatter(format="%d%%"),
label_standoff=6, border_line_color=None, location=(0, 0))
p.add_layout(color_bar, 'right')
output_notebook()
show(p)
Is it worth to use Latent Dirichlet Allocation method?
The big advantage of using this method among others is that it gives a score for each topic belongingness, not just the topic indicator which can be helpful for determining the actual result correctness. It is quite important if we are training our model on unlabeled data.
Latent Dirichlet Allocation is pretty effective and really simple to use with Python which basically gives the machine learning capabilities to everyone who needs it. You can find all the code used here in our Github repository.
Bibliography
[1] "Python Machine Learning" - Sebastian Raschka, Vahid Mirjalili
[2] NLTK documentation
[3] Scikit learn documentation
[4] "Latent Dirichlet Allocation explained"