In this post, we introduce a parallel LDA (Latent Dirichlet Allocation) library. It’s available on Github. The library is written in Scala with Akka actors (hence the name Reactive). The underlying algorithm is based on vanilla Gibbs sampling. We will show that this approach scales much better than training LDA with collapsed Gibbs sampler, due to the easiness of parallelism.
Traditionally, training of the Latent Dirichlet Allocation model falls into two categories: training with collapsed Gibbs sampler and training with variational Bayes. In this blog post, we limit our discussions on MCMC approaches, mainly because: 1) they are much simpler to implement and maintain; 2) In Griffiths & Steyvers (2004), it is reported that collapsed Gibbs sampler converges to a ground-truth model faster than variational Bayes.
We need to point out that an online variational Bayes algorithm exisits (Hoffman et al 2010). Hoffman provides a Python implementation, and it is used in the Gensim library. Empirical results show that our implementation can generate comparable (or better) models in comparable time. Mahout also has an implementation based on mean field variational Bayes. Note that all variational approaches have to modify the original graphical model to simply dependencies.
Recap of LDA and the Gibbs Sampler
In LDA model, each document d has a topic vector θd, which is a multinomial distribution over topics; each topic has a topic-word vector, which is a multinomial distribution over words. It models that each word in a document is independently generated in two steps: 1) sample a topic from θd; 2) sample a word from that topic. The objective of LDA learning is: given words in the corpus, infer topic-word distributions.
We now describe the Gibbs sampling algorithm. We follow the notations in Blei’s original LDA paper in 2003. Let θi be the document-topic vector for document i, let β be the topic-word matrix, where βt,w is the conditional probability p(word = w | topic = t). Let wi,j denote the j-th word in the i-th document, and zi,j be the hidden variable (topic) associated with wi,j . α and η are hyperparameters for Dirichlet distributions.
The vanilla Gibbs sampling algorithm is (c.f. Kevin Murphy’s machine learning book):
Note that the rows with z’s can be sampled independently, as suggested by the updating formula. Once the first N rows are sampled, we can start sampling the β’s.
The more traditional approach for LDA training is to use collapsed Gibbs sampling. In that way, one only samples the z’s, by integrating out θ and β. This saves us some time in sampling, but we lose the ability to go parallel. (Integrating out θ and β introduces dependencies among the z’s.) Every time we sample a z, we need to update a counting matrix, then sample another z, update the counting matrix, and so on. This sequential fashion is not desirable in large scale case.
There are a few approximated collapsed Gibbs sampling algorithms for LDA, mostly relying on the hypothesis that the dependencies among the z’s are weak, so one just sample the z’s in parallel, hoping this would not degrade the model.
Gibbs Sampling with Akka Actors
In this section, we describe some implementation details. The actor system is illustrated below:
There are three types of actors:
- DocIter: this is used for fetching documents. It our implementation, this could be in-memory or off-memory (reads from disk in batch).
- Sampler: this is responsible for sampling θi and z’s.
- Master: it connects to several samplers via an Akka router (e.g. RoundRobinRouter). Master stores θ and β. It is responsible for sampling β when a whole batch finishes.
Now we describe the interactions among actors:
- Master to DocIter: sends a MiniBatchRequest message. In return, DocIter sends back a MiniBatchDocs message, which contains a set of documents.
- Master to Sampler:
- During initialization stage, it sends a uniform sampling request, with a doc.
- In all subsequent iterations, it sends a sampling request, which contains (the reference of) doc i, θi , β.
- Sampler to Master: sends sampled θi and z’s to Master.
Note that this has a strong Map-Reduce fashion. In the future, it would be interesting to see how this implementation generalizes to remote Akka actors.
Memory consumption and performance
In current implementation, we hold the following variables in memory: θ, β, and a counting matrix C where Ct,w gives number of times word w is labeled as t. Note that we don’t store the corpus and the z’s in memory (which would dominate memory usage). Thus, if we have N documents, T topics, V vocabulary size, the memory requirement is roughly: 4*N*T for θ (if we use floats), 4*T*V for β (with floats), 4*T*V for the counting matrix. For example, if N = 3M, T = 500, V = 100K, the minimal memory requirement would be 10G. In the future, we might change the implementation to use sparse vectors to reduce memory footprint.
We have done some experiments on English wikipedia: 3M documents, 100K vocabulary size, 512 topics. On average, running 1 iteration of LDA takes roughly 10 minutes on an Amazon instance with 32 virtual CPUs. In practice, we can get a reasonably good model after 50 iterations, which is less than half a day.
Using vanilla Gibbs sampler instead of collapsed Gibbs sampler sounds like a step backwards. However, this gives us much better scalability. You can check out our library on Github. The library is in very early stage. Please feel free to make comments, ask questions, or better yet, make contributions.
We wrote this post while working on Kifi — tools that help people outsmart information overload together. Learn more.