Reactive LDA Library



In this post, we introduce a parallel LDA (Latent Dirichlet Allocation) library. It’s available on Github. The library is written in Scala with Akka actors (hence the name Reactive). The underlying algorithm is based on vanilla Gibbs sampling. We will show that this approach scales much better than training LDA with collapsed Gibbs sampler, due to the easiness of parallelism.


Traditionally, training of the Latent Dirichlet Allocation model falls into two categories: training with collapsed Gibbs sampler and training with variational Bayes. In this blog post, we limit our discussions on MCMC approaches, mainly because: 1) they are much simpler to implement and maintain; 2) In Griffiths & Steyvers (2004), it is reported that collapsed Gibbs sampler converges to a ground-truth model faster than variational Bayes.

We need to point out that an online variational Bayes algorithm exisits (Hoffman et al 2010). Hoffman provides a Python implementation, and it is used in the Gensim library. Empirical results show that our implementation can generate comparable (or better) models in comparable time. Mahout also has an implementation based on mean field variational Bayes. Note that all variational approaches have to modify the original graphical model to simply dependencies.

Recap of LDA and the Gibbs Sampler

In LDA model, each document d has a topic vector θd, which is a multinomial distribution over topics; each topic has a topic-word vector, which is a multinomial distribution over words. It models that each word in a document is independently generated in two steps: 1) sample a topic from θd; 2) sample a word from that topic. The objective of LDA learning is: given words in the corpus, infer topic-word distributions.

We now describe the Gibbs sampling algorithm. We follow the notations in Blei’s original LDA paper in 2003. Let θi be the document-topic vector for document i, let β be the topic-word matrix, where βt,w is the conditional probability p(word = w | topic = t). Let wi,j denote the j-th word in the i-th document, and zi,j be the hidden variable (topic) associated with wi,j . α and η are hyperparameters for Dirichlet distributions.

The vanilla Gibbs sampling algorithm is (c.f. Kevin Murphy’s machine learning book):


We can sample the variables in the following order (row by row, left to right):

Note that the rows with z’s can be sampled independently, as suggested by the updating formula. Once the first N rows are sampled, we can start sampling the β’s.

The more traditional approach for LDA training is to use collapsed Gibbs sampling. In that way, one only samples the z’s, by integrating out θ and β. This saves us some time in sampling, but we lose the ability to go parallel. (Integrating out θ and β introduces dependencies among the z’s.) Every time we sample a z, we need to update a counting matrix, then sample another z, update the counting matrix, and so on. This sequential fashion is not desirable in large scale case.

There are a few approximated collapsed Gibbs sampling algorithms for LDA, mostly relying on the hypothesis that the dependencies among the z’s are weak, so one just sample the z’s in parallel, hoping this would not degrade the model.

Gibbs Sampling with Akka Actors

In this section, we describe some implementation details. The actor system is illustrated below:



There are three types of actors:

  • DocIter: this is used for fetching documents. It our implementation, this could be in-memory or off-memory (reads from disk in batch).
  • Sampler: this is responsible for sampling θi and z’s.
  • Master: it connects to several samplers via an Akka router (e.g. RoundRobinRouter). Master stores θ and β. It is responsible for sampling β when a whole batch finishes.

Now we describe the interactions among actors:

  • Master to DocIter: sends a MiniBatchRequest message. In return, DocIter sends back a MiniBatchDocs message, which contains a set of documents.
  • Master to Sampler:
    • During initialization stage, it sends a uniform sampling request, with a doc.
    • In all subsequent iterations, it sends a sampling request, which contains (the reference of) doc i, θi , β.
  • Sampler to Master: sends sampled θi and z’s to Master.

Note that this has a strong Map-Reduce fashion. In the future, it would be interesting to see how this implementation generalizes to remote Akka actors.

 Memory consumption and performance

In current implementation, we hold the following variables in memory: θ, β, and a counting matrix C where Ct,w gives number of times word w is labeled as t. Note that we don’t store the corpus and the z’s in memory (which would dominate memory usage). Thus, if we have N documents, T topics, V vocabulary size, the memory requirement is roughly: 4*N*T for θ (if we use floats), 4*T*V for β (with floats), 4*T*V for the counting matrix. For example, if N = 3M, T = 500, V = 100K, the minimal memory requirement would be 10G. In the future, we might change the implementation to use sparse vectors to reduce memory footprint.

We have done some experiments on English wikipedia: 3M documents, 100K vocabulary size, 512 topics. On average, running 1 iteration of LDA takes roughly 10 minutes on an Amazon instance with 32 virtual CPUs. In practice, we can get a reasonably good model after 50 iterations, which is less than half a day.


Using vanilla Gibbs sampler instead of collapsed Gibbs sampler sounds like a step backwards. However, this gives us much better scalability. You can check out our library on Github. The library is in very early stage. Please feel free to make comments, ask questions, or better yet, make contributions.


We wrote this post while working on Kifi — tools that help people outsmart information overload together. Learn more.

Day 1: Finish the First Feature!


It was my first day at Kifi, and first thing in the morning…
… as I walked up the stairs to the apartment-like office, our CTO, Eishay, met me at the top. He had a beaming grin.

“So your job today is to finish a complete production cycle.”
“By the end of today, you should have a small feature ready to be committed, pushed and deployed onto our website.”
Uhhh… whoa, already? I haven’t even entered the front door yet!

It’s been about a week since I started here at Kifi. On the first day, I was also welcomed by Tamila, the rest of the team, and a small pot containing a bamboo plant. I was already feeling a bit jittery as I arrived at my desk space. I had some work to do, so I couldn’t wait to rev my engines and start getting my hands dirty!

As a newly hired engineer, by the end of the first day, I was expected to get a small feature committed to the company repositories and deployed on our website as the entire production cycle. But Jeez-Laweez, within hours of walking into the door, whatever I created would be seen by our entire user base! Or, if I’m terrabad, I might end up as that new guy who breaks builds on the first day. So uh yea, better not screw up!

Don't be THAT new guy!
Of course, the feature is small and relatively easy. But even so, I got some substantial exposure to many parts of the codebase, especially the extensive back-end that’s filled with Scala/Play black magic. Prior to starting at Kifi, I didn’t have any experience in this mythical Scala language – but even after re-reading a couple Scala snippets, they still look like straight-up transcribed Parsel-tongue / Dark Arts Magic.

Thankfully, the Kifi team is extremely supportive and approachable, so on my first day, I was able to deploy my first feature! Within a few more days, the codebase didn’t look so voodoo anymore and I built a fully refined feature (Exporting Keeps), with a complete UI, and extensive testings. And I really mean extensive – unit tests, database tests, integration tests, you name it, we got em all!

Something I really enjoy about Kifi is how detailed-oriented everybody is. Here at Kifi, you have a couple of all-star engineers who are super legit and busy cranking out some really advanced technology. But, even so, they’re still happy to stop and answer my Scala/Angular noob questions and provide ample feedback during code reviews. It’s kind of cool thinking about how the feedback you get is coming from a couple of industry wizards and later, you get to work side-by-side with them. Playing with this all-star team is like gaming alongside NBA veterans, like Kobe Bryant or Tim Duncan, as well as talented youngins, like Steph Curry (#GoDubs). Whether you’ve got a lot of experience or you’re transitioning from a different engineering field like me, there’s plenty of room to learn and everybody’s willing to help.

How far can we go together and how big can we get? I’d say it’s still way too early to tell, but I trust that we will continue to work hard and more importantly, think smart.

As for me…

still long ways to go
My Defense against the Dark Arts is slowly getting there… but probably still long ways to go?

We wrote this post while working on Kifi — tools that help people outsmart information overload together. Learn more.

Dynamic sizing content in Keep cards

For our latest design of, we’ve moved to a graphical, content-centric view for keeps. Each keep appears as a card that contains the title, description, primary image, and metadata about the keep itself. We wanted to design an experience where the user could quickly digest the content of the keep, so wanted to optimize the card presentation as much as possible.

Occasionally, we have either no description or a very short one — but we do have a large image. For cases like this, we display the image full sized:

Other times, we have no good image, but do have a textual description. This time, the description is full sized:

For times when we have both a description and image, we wanted an experience that would balance the image and description as best as possible, to create a newspaper / magazine style layout. This gets a bit tricky: the description and image can be any* size (strictly speaking, descriptions are between 20 and ~500 words, and images are larger than 50×50).

Since clients may have different window sizes, we needed to dynamically size the images and text. There’s a few ways to approach this. Initially, I attempted to size the content to minimize the total height of the card. Unfortunately, this gave a large bias towards smaller images, and most of my keeps cards looked the same.

Here’s are a few layout options for the same keep:

We agreed that even though the third design minimized the keep height, the second design looked the best. Remember that this is completely dependent on: a) the keep card size (i.e., my window size), the description length, and the image size. A rough 50/50 split between text and image doesn’t always work. Consider these keeps:

The visually optimal sizes, based on manual experimentation, seemed to use smaller images for these specific keeps. In fact, the “optimal” sizing seemed to vary quite a bit — we liked a 50/50 split when possible, but often 60/40, 70/30, or 80/20 was better. And on edge cases, 90/10 was sometimes best.

We decided that what we most wanted was to minimize wasted white space between the bottom of the image and the text.

Unfortunately, this is difficult to calculate directly. We know the keep card width, the description string, and the image dimensions, but it’s not simple to calculate the delta between the image height and the rendered text height (wasted space) since we don’t know how the text will be drawn. Using character and word counts for an approximation yielded very inconsistent results.

    var $sizer = angular.element('.kf-keep-description-sizer');
    var cardWidth = element.find('.kf-keep-contents')[0].offsetWidth;
    function calcHeightDelta(guessWidth) {
      function tryWidth(width) {
        $sizer[0].style.width = width + 'px';
        return $sizer[0].offsetHeight;
      var imageWidth = cardWidth - guessWidth;
      var imageHeight = imageWidth / (img.w / img.h);
      var textHeight = tryWidth(guessWidth);
      var delta = textHeight - imageHeight;
      return { guess: guessWidth, delta: Math.abs(delta), textHeight: Math.ceil(textHeight), imageHeight: imageHeight};

By rendering a keep card off screen hidden from the user, we can directly measure these. We can avoid a full reflow by absolutely positioning the card, so just pay the cost of the browser rendering the card, and us measuring the delta. My initial approach was to set the image width, then binary search for the minimum delta: take two samples, determine the better delta, try again with a better image width guesses. Unfortunately, the delta doesn’t follow a smooth curve.

This shouldn’t be too surprising: words wrap, causing jumps in the delta. Thus, simply binary searching can’t be trusted to find the actual minimum. Furthermore, in other examples, there were two primary local minimum (two different image widths that yielded local minimums).

Instead of complicating the binary search, we simplified it: we’d take a sample every 40px, and choose the best. This most often yielded very close to the optimal image width, with far fewer samples.

    var i = 0;
    var low = 200, high = cardWidth - 80; // text must be minimum 200px wide, max total-80
    var guess = (high - low) / 2 + low;
    var res = calcHeightDelta(guess);
    var bestRes = res;
    while(low + i < high && > 20) {
      res = calcHeightDelta(low + i);
      if ( > {
        bestRes = res;
      i += 40;

We then sized the card accordingly, taking care to size the text height so that we don’t cut a line in two horizontally. This yielded quite nice results, and was fast enough. After playing with this with tens of keeps, we found several examples where the image size did not change the delta significantly: for a large range of image sizes, the delta was roughly the same. This would give slightly arbitrary minimums, so we added a scoring mechanism so we could give a bias for sizes that yielded roughly a 50/50 split.

    var optimalWidth = Math.floor(cardWidth * 0.50);
    // ...
    var score = Math.abs(delta) + 0.3 * Math.abs(optimalWidth - imageWidth); // 30% penalty for distance away from optimal width

We can also tweak the score for image sizes that are larger than their originals, since it looks bad:

    if (imageHeight > img.h) {
      score += (imageHeight - img.h);
    if (imageWidth > img.w) {
      score += (imageWidth - img.w);

This delta minimization + scoring mechanism gave us a very pleasant range of keep card variations, with some small images, some large, and several in between.



We wrote this post while working on Kifi — tools that help people outsmart information overload together. Learn more.

A Spotlight Effect for Interactive Guides in HTML

For decades, video game designers have built tutorials or getting-started guides into their games to help new players ramp up quickly. Nowadays it’s common to see helpful overlays introducing me to a user interface when I try a new app on my phone or sign up for a new service online. The best guides or tutorials allow me to actually use the interface they’re teaching me about, since people learn best by doing.

One technique often used to call attention to part of a web page or application is masking. Since a hole can’t be cut through a DOM element to permit interaction with an element behind it, four elements (rectangles) are typically assembled around the focal area to mask out and deemphasize everything else.

Element arrangement for a traditional rectangular mask

Element arrangement for a traditional rectangular mask
Demo: Traditional Rectangular Masking

I recently worked on a new Kifi user guide, one of several exciting new product updates. The design called for a round spotlight mask. Creating a circle or ellipse with positive space is simple (border-radius: 50%), but creating one with negative space (a hole) can be tricky. My first idea was to use transparency in radial gradients on a traditional four-element mask to create the illusion of a single round hole. The smallest circle we can form this way circumscribes the rectangular hole.

Radial gradients applied to the elements of a traditional rectangular mask

Radial gradients applied to the elements of a traditional rectangular mask
Demo: Round Spotlight Formed Using Radial Gradients

This radial gradient approach mostly works, but has several problems. A floating point precision issue in Chrome results in unpredictable misalignment and distracting notches in the edge of the spotlight. No major browser consistently renders a smooth, antialiased boundary between the transparent and opaque portions of the radial gradient. In addition, the frame rate can be suboptimal when animating a spotlight formed from radial gradients since the browser must recalculate every pixel color value in the entire mask for every frame, even if the spotlight is relatively small or only making a small movement from one frame to the next.

To address these issues, I ditched the radial gradients and instead constructed the mask from four HTML canvas elements, which allow for precise control of every pixel and smooth antialiasing.

Demo: Round Spotlight Drawn on Canvas Elements

I encountered a few complications using canvas elements. First, when a canvas is resized, it is automatically cleared. Any content must be fully redrawn into the canvas. To avoid incurring this cost for every animation frame, I initialize each canvas element to be the full viewport size, allowing portions of each canvas to extend beyond the viewport. This way, all that needs to be done to move the spotlight from one frame to the next is:

for each of the four canvas elements:
1. translate the canvas so that one side forms an edge of the new rectangular hole
2. black out the portion of the canvas that contained part of the spotlight in the previous frame
3. draw the spotlight in the canvas at its new position, with its new size

A nice property of this approach is that most of the pixels in the drawing surface of each canvas element are untouched from one frame to the next. As a result, the browser can achieve high frame rates (smooth motion), especially when the spotlight is only covering a small portion of the screen. Note: In situations where you know a priori that a spotlight will only ever move within a small portion of the viewport, you can consume less memory by using smaller canvas elements.

Two possible arrangements of viewport-sized canvas elements, the second avoiding overlap

Overlapping arrangement of viewport-sized canvas elementsNo-overlap arrangement of viewport-sized canvas elements

A second issue I encountered using canvas elements is that browsers currently disagree on how a canvas should behave on high-resolution displays. Some upscale the logical canvas to the high-resolution display, which can create aliasing effects such as banding along the edges of the four canvas elements comprising the mask (a deal breaker for this effect). Fortunately, the upscaling can be detected and avoided on browsers that do it.

I hope you found this description and demo of our spotlight effect interesting and helpful. For more implementation details, please dive into the JavaScript source code included with this post in your browser’s developer tools (spotlight_figures.js, spotlight_rect.js, spotlight_gradient.js, spotlight_canvas.js). We also welcome questions in the comments area below.

In a future post, I’ll describe how we built the animated curved arrow that our Guide to Kifi uses in conjunction with the spotlight effect.

We wrote this post while working on Kifi — tools that help people outsmart information overload together. Learn more.

Typesafe case study on Kifi

We were happy to participate in a case study that Typesafe wrote on Kifi. Specifically, they were interested in how we used the Typesafe Platform to build the reactive and scalable platform that powers Kifi.

The service’s unique approach to search is based on users easily tagging and keeping anything found online – articles, videos, pictures, email – allowing these saved items to be easily found on top of favorite search engine results from that point forward. It’s like a higher powered bookmarking system that includes the best elements of search and social.

Read it here. Big thanks to Laura Masterson and Alex Salkever for doing this.

We wrote this post while working on Kifi — tools that help people outsmart information overload together. Learn more.

Scala macro annotations: a real-world example

TL;DR: Also, sign up for Kifi.

Scala macros can be a great tool to reduce boilerplate code. For example, Play provides macros for easily creating Json serializers. Given a case class:

case class A(field1: Int, field2: String)

it is easy to define a serializer inside the companion object:

object A {
  implicit val format: Format[A] = Json.format[A]

Here, Json.format[A] is a macro that will transform the definition of format at compile time into the following full-fledged serializer:

implicit val format: Format[A] = (
  (__ \ ‘field1).read[Int]) and
  (__ \ ‘field2).read[String])
)(A.apply, unlift(A.unapply))

This implicit format can then be used by the methods Json.toJson and Json.fromJson to serialize/deserialize instances of class A.

While this already greatly simplifies the creation of serializers, here are two small caveats:

  1. In our production code for Kifi, we have many different data models represented by case classes. In most cases we also want a Json format, so every time we need to create a companion object whose sole purpose is to contain the format definition. Can we make that even simpler?
  2. Sometimes the case class has only one field, i.e. it is just a wrapper around another type such as String or Int. For example:
    case class Age(value: Int)

    In that case, Play’s default serialization will be {"value": 34}, which is fine but we’d rather have just 34, since it’s shorter and there is no ambiguity on the field. So we need to come up with our own format, which is something like:

    object Age {
      implicit val format: Format[Age] = Format([Int].map(s, Age(s)),
        new Writes[Age] { def writes(o: Age) = Json.toJson(o.value) }

    It would be nice if this could be simplified.

  3. What should we do?


It turns out these two problems can be elegantly solved with a single Scala macro annotation. Macro annotations are a new experimental feature of Scala macros, only available in macro paradise for the moment. Before we go through any technical details, here is what the result looks like:

@json case class A(field1: Int, field2: String)

That’s it! No need to explicitly create a companion object, and if the case class has only one field, we will get our “short” serializer instead of Play’s default one. The only thing we need to do is to add the @json annotation in front of the class definition. How did that magic happen?

During compilation, the @json annotation triggers the execution of a macro on the class definition. This way the class definition can be modified as desired. Even better, this also allows the companion object to be modified, or created if it doesn’t already exist. Here’s how the definition of the annotation looks:

class json extends StaticAnnotation {
  def macroTransform(annottees: Any*) = macro impl
  private def impl(c: Context)(annottees: c.Expr[Any]*): c.Expr[Any] = {
    import c.universe._
    def modifiedClass(classDecl: ClassDef, compDeclOpt: Option[ModuleDef]) = {
      // Class modification logic goes here
    } match {
      case (classDecl: ClassDef) :: Nil => modifiedClass(c, classDecl)
      case (classDecl: ClassDef) :: (compDecl: ModuleDef) :: Nil => modifiedClass(c, classDecl, Some(compDecl))
      case _ => c.abort(c.enclosingPosition, "Invalid annottee")

The annotation is created by extending StaticAnnotation, and providing a macro implementation for the macroTransform method (using the keyword macro). The implementation itself inspects the annottees (i.e. the syntax tree nodes corresponding to the code under the annotation). It extracts the class declaration (classDecl), and also the companion object declaration (compDecl) if available.

The helper method modifiedClass is where the black magic takes place. The goal is to create an Abstract Syntax Tree (= a piece of code, but in a form that the compiler understands) corresponding to the class + companion object declaration that you want to produce. Without going too much into the specifics, here are the main steps of the transformation (the complete source code can be found here):

  • Use unquoting to extract the name of the case class and its fields (we need to know how many fields the case class has in order to generate the correct Json formatter):
    val (className, fields) = try {
      val q"case class $className(..$fields) extends ..$bases { ..$body }" = classDecl
      (className, fields)
    } catch {
      case _: MatchError => c.abort(c.enclosingPosition, "Annotation is only supported on case class")

    Note the use of c.abort for aborting the execution of the macro with an error message.

  • Generate the appropriate json format. We’re using unquoting again, but this time it’s the other way around since we’re putting things in quasiquotes instead of extracting them:
    val format = fields.length match {
      case 0 => c.abort(c.enclosingPosition, "Cannot create json formatter for case class with no fields")
      case 1 =>
        // Only one field, use the serializer for the field
          implicit val jsonAnnotationFormat = {
            import play.api.libs.json._
    [${fields.head.tpt}].map(s => ${className.toTermName}(s)),
              new Writes[$className] { def writes(o: $className) = Json.toJson(o.${}) }
      case _ =>
        // More than one field, use Play's Json.format[T] macro
        q"implicit val jsonAnnotationFormat = play.api.libs.json.Json.format[$className]"
  • Use unquoting both ways (for extraction and insertion) to insert the format in the companion object if it already exists:
    val q"object $obj extends ..$bases { ..$body }" = compDecl
      object $obj extends ..$bases {

A more detailed explanation of unquoting is available in the documentation.

Finally, remember that using macros requires compilation to happen in two steps: first, compile the macros, then compile the code where the macros are used. This is necessary so that your macros can be run before the rest of your code is compiled. For example if you use SBT, you can configure Build.scala to use two modules, a “macros” module containing your macros, and a “root” module that depends on the “macros” module.

Sounds great, can I use it?

Yes! But keep in mind that macro annotations are still experimental. If you use Play (version 2.1 or higher) you should just need to add the following two settings to your build configuration:

libraryDependencies += "com.kifi" %% "json-annotation" % "0.1"

addCompilerPlugin("org.scalamacros" % "paradise" % "2.0.1" cross CrossVersion.full)

Then import com.kifi.macros.json whenever you want to use the @json annotation. More information can be found at along with the source code, in case you want to start writing your own macro annotations. Happy hacking! :)

We wrote this post while working on Kifi — tools that help people outsmart information overload together. Learn more.

Reactive Scala Wrapper for Amazon SQS


The backend for Kifi is built as a fully Service Oriented Architecture. In the context of that, we run into two issues:

  1. The System needs to be able to deal with the loss of any machine so long as there are others of the same type. In other words, a machine may be shut down rather abruptly at any time. This is not just to deal with failure but also to allow us to do Continuous Deployment in a seamless fashion. Real time inter service calls can be fairly easily made robust by smart request routing (a subject for another post), but we still needed a way of resuming longer running tasks that are in progress when a machine goes down.
  2. We frequently move large amounts of data from one service to another. We don’t want one service requesting or sending a lot of data to be able to overwhelm another, i.e. we needed a mechanism that would allow the producer of some data to produce as fast as it want to and the consumer to consume as fast as it wants to.

Since we run on top of AWS, we chose Amazons Simple Queue Service (SQS) to deal with these two issues. SQS is basically a managed persistent message passing system.

The fact that it’s persistent (for up to 2 weeks) pretty much deals with issue number 2 above, i.e. the producer can just put data into the queue as fast and as much as it wants to and it will happily sit there for the consumer to catch up.

SQS also supports long polling for messages, which makes it easy to reactively read from the queue.

Issue number 1 above is solved by the fact that SQS messages are read from the queue and deleted separately. When a message is read from the queue it is not actually removed, but becomes invisible to clients for a certain (configurable) amount of time. The client who read the message can then process it and once done delete the message. If there is an error (e.g. because the service was shut down in the middle of processing) the message will reappear in the queue and another service can process it.


Amazon provides a very nice Java library for dealing with SQS (which already has reactive elements), but since we run on Scala/Play we wanted something that is easier to use from Scala, as reactive as possible and plays nicely with the Play Framework.

We ended up writing a reactive Scala wrapper for the Amazon Java Library that, in addition to lower level queue operation also lets you have typed queues, so long as the type has a Play! style implicit Json serializer.

For example, suppose you have some type Data with an implicit Play Json serializer (a Format[Data]) sending an object of that type to the queue with the native Java Client looks something like this

val sqs = new AmazonSQSAsyncClient()
val request = new SendMessageRequest
   new AsyncHandler[SendMessageRequest,SendMessageResult]{
       def onError(exception: Exception) = ???
       def onSuccess(req: SendMessageRequest, res: SendMessageResult) = ???

Using Franz it looks like this

val sqs = SimpleSQSClient(, Regions.US_WEST_1)
val queue = sqs.formatted[Data]()

where the send call simply returns a future which will resolve successfully iff the message has been successfully sent and contain an exception otherwise.

You can find the actual library (MIT licensed) as well as technical documentation here: It’s available from Maven central, so if you are using sbt, just add

"com.kifi" % "franz_2.10" % "0.3.2"

to your dependencies and you should be good to go. All classes are in “com.kifi.franz”.

It’s still in early stages of development, but the existing feature set is quite robust. We make heavy use of it in Production. Feedback in the form of issues, pull requests, etc. very welcome.


We wrote this post while working on Kifi — tools that help people outsmart information overload together. Learn more.

Reflections on Companion Type Systems

TLDR: When manipulating a sealed trait and its subclasses, it can be useful to define a dual type hierarchy on their companion objects. We can use reflection to test this “Companion Type System” and safely gather all of its members.

That was short, now keep this post so that you can read it later!

Example: Polymorphic Serialization in a Persistent Event Queue

In a previous post, we described how database updates had to be propagated through our system in order to be indexed with Lucene.

Now, in a similar way, we want to use a persistent event queue in order to propagate some updates from one storage system (e.g. database) to others (e.g. graph, feature extraction engine…). The queue needs to support several types of events:

sealed trait Event
case class AddUser(userId: Long) extends Event
case class AddFriendship(userId: Long, friendId: Long) extends Event
case class ForgetFriendship(userId: Long, friendId: Long) extends Event

Assuming we use Json to get our events in and out of the queue, we have to define a Json formatter for Event. For it to work in a polymorphic fashion, some information about the event subtype has to be serialized together with the actual instance. In this way, at deserialization, we can read that type information before picking up a deserializer that will instantiate the proper event subtype.

In short, we need two components for each subtype:
- a Json formatter able to translate a subclass instance to/from a serialized value.
- some sort of header that serves as a serializable identifier of the subtype.

Companion objects are a natural home for these helpers and here’s what our solution looks like:

import play.api.libs.json._
sealed trait EventCompanion[E <: Event] {
  def format: Format[E]
  def header: String
object EventCompanion {
  val all: Set[EventCompanion[_ <: Event]] = Set(AddUser, AddFriendship, ForgetFriendship)
  val byHeader: Map[String, EventCompanion[_ <: Event]] = {
    require( == all.size, "Type headers must be unique.") { companion => companion.header -> companion }.toMap
sealed trait Event { self =>
  type E >: self.type <: Event
  def companion: EventCompanion[E]
  def instance: E = self
object Event {
  val format: Format[Event] = new Format[Event] {
    def writes(event: Event): JsValue = Json.obj(
      "header" -> event.companion.header, 
      "value" -> event.companion.format.writes(event.instance)
    def reads(json: JsValue): JsResult[Event] = (json  "header").validate[String].flatMap { header =>
      EventCompanion.byHeader(header).format.reads(json  "value")
case class AddUser(userId: Long) extends Event {
  type E = AddUser
  def companion = AddUser
case object AddUser extends EventCompanion[AddUser] {
  val format = Json.format[AddUser]
  val header = "add_user"
case class AddFriendship(userId: Long, friendId: Long) extends Event {
  type E = AddFriendship
  def companion = AddFriendship
case object AddFriendship extends EventCompanion[AddFriendship] {
  val format = Json.format[AddFriendship]
  val header = "add_friendship"
case class ForgetFriendship(userId: Long, friendId: Long) extends Event {
  type E = ForgetFriendship
  def companion = ForgetFriendship
case object ForgetFriendship extends EventCompanion[ForgetFriendship] {
  val format = Json.format[ForgetFriendship]
  val header = "forget_friendship"

In order to write our polymorphic Event formatter, each event subtype’s companion object had to define the proper serialization helpers. Thus we ended up creating an EventCompanion trait that every companion inherits from.

Let’s take a step back and review this pattern.

Introducing Companion Type Systems

Scala’s OOP features make it possible for instances of different classes to share some common interface. Simply have each of these classes be a subtype of some common superclass (or trait) that defines the interface – classic stuff.

Now, it is sometimes desirable for the companion objects of these classes to share some common interface as well. We can achieve that in a similar fashion, having each companion object extend some common superclass. This set of typed companion objects is referred to as the “Companion Type System” of the class (yes, we have just made that up).

In short, the Companion Type System makes it possible to take advantage of polymorphism not only when manipulating dynamic instances (thanks to the main class hierarchy), but also when no instance is around (thanks to the static companion hierarchy).

This situation often arises in the context of serialization / deserialization of some class hierarchy. At FortyTwo, we rely on this pattern in several places: persistent event queues, storage and retrieval of vertex and edge data in our graph, conversion of typed database ids to global ids…

Companion Type Systems even use F-Bounded Types

So far, we have overlooked the use of an F-bounded type E in our example, both in the sealed trait Event and the companion trait EventCompanion. In short, an F-bounded type is used to refer to an unknown subclass in the context of some superclass. It provides us with finer reasoning assumptions and extra type-safety as the compiler can better check the consistency of the class API.

F-bounded types are also known as self-recursive types, though while this name makes sense for the main trait (Event), it does not for the companion trait (EventCompanion). Indeed in both cases, E refers to the subclasses of Event, thus we can only qualify it as “self-recursive” in Event. So we’ll stick with “F-bounded type” here, plus it sounds smart enough.

An F-bounded type can be introduced either as a type parameter (cf. EventCompanion) or as an abstract type member (cf. Event). We won’t discuss the advantages of one over the other in general. However, we found it hard to make the compiler happy when both F-bounded types are type parameters, mainly because of mutually dependent wildcards showing up here and there. In order to avoid this, it seems best to have the self-recursive type be a type member, like in our example. Besides, it is usually useful to make the F-bounded type of the companion trait a type parameter for it to be a part of the companions’ type signatures, as you may end up passing them around implicitly (again, like in our example).

Final Reflections

Finally, here is how the Companion Type System pattern lays out in our example:

Companion Type System.001.jpg.001

Now let’s say someone wants to define a new event AddHostility? There are two main risks:
- not setting up F-bounded type E properly in the AddHostility subclass or its companion object
- forgetting to explicitly register AddHostility in the set of all companion objects in EventCompanion

Thanks to the bounds on E and the companion accessor defined in the Event trait, the compiler should actually catch most issues related to the first point. Regarding the second point, we would like to use reflection in order to gather all companion objects in EventCompanion and rewrite it as follows:

object EventCompanion {
  val all: Set[EventCompanion[_ <: Event] = CompanionTypeSystem[Event, EventCompanion[_ <: Event]]
  def byHeader(header: String): Map[String, EventCompanion[_ <: Event]] = {
    require( == all.size, "Type headers must be unique.") { companion => companion.header -> companion }.toMap

Now as we reflect on the class hierarchy to collect each companion object, we may as well check a few assumptions on our companion type system and make sure that all F-bounded types have been set up correctly. These tests can provide useful specifications anytime we design a new companion type system.

Our implementation of CompanionTypeSystem is a good opportunity to explore the Scala reflection API.

Let’s outline the logic of:

 def apply[SealedClass: TypeTag, Companion: TypeTag](fBoundedType: String): Set[Companion]

As a reminder, in our example, this method is called with SealedClass = Event, Companion = EventCompanion[_ <: Event] and fBoundedType = "E".

In a first step, we reflect on the top level classes (traits) SealedClass and Companion in order to figure out whether fBoundedType is introduced as a type parameter or as an abstract type member. In both cases, we want to check assumptions on its upper bound and on whether or not it has been set up as a self-recursive type (only expected in SealedClass). This happens in:

def getTypeWithTypeParameterOrElseCheckTypeMember(owner: Type, fBoundedType: TypeName, upperBound: Type, isSelfRecursive: Boolean): Option[Type => Type]

In a second step, for each of SealedClass‘s child, we must check that fBoundedType has been assigned with the subclass’ own type, both in the subclass itself and in its companion object. While checking the value of an abstract type member is pretty straightforward, checking the value of a type parameter reduces to checking inheritance after setting the type argument. This is why getTypeWithTypeParameterOrElseCheckTypeMember returns a function if fBoundedType actually is a type parameter. That function enables us to substitute type parameter fBoundedType with a valid type argument and use the resulting type in inheritance checks.

We want to iterate over all the children of SealedClass. This is where sealing the main class/trait is coming handy, even though our experience suggests that this is something that you want anyway wherever a companion type system seems like a good idea.

Note that nothing requires the Companion trait to be sealed, even though EventCompanion is in our example and each object is a case object. This comes for free since a companion object has to be declared next to its class anyway. It provides extra convenience in pattern matching as we find ourselves passing these companions around, regarding them as rich type representatives of their class at runtime.

And that’s it, we test our assumptions on abstract type members and/or type parameters for each subclass and collect all the companion objects, which end up forming our Companion Type System.

Editor’s Note: Any convenient bending of theoretical concepts in the context of this post is the author’s sole responsibility and does not necessarily represent the views of FortyTwo. Check out Kifi!

We wrote this post while working on Kifi — tools that help people outsmart information overload together. Learn more.

From word2vec to doc2vec: an approach driven by Chinese restaurant process

Google’s word2vec project has created lots of interests in the text mining community. It’s a neural network language model that is “both supervised and unsupervised”. Unsupervised in the sense that you only have to provide a big corpus, say English wiki. Supervised in the sense that the model cleverly generates supervised learning tasks from the corpus. How? Two approaches, known as Continuous Bag of Words (CBOW) and Skip-Gram (See Figure 1 in this paper). CBOW forces the neural net to predict current word by surrounding words, and Skip-Gram forces the neural net to predict surrounding words of the current word. Training is essentially a classic back-propagation method with a few optimization and approximation tricks (e.g. hierarchical softmax).

Word vectors generated by the neural net have nice semantic and syntactic behaviors. Semantically, “iOS” is close to “Android”. Syntactically, “boys” minus “boy” is close to “girls” minus “girl”. One can checkout more examples here.

Although this provides high quality word vectors, there is still no clear way to combine them into a high quality document vector. In this article, we discuss one possible heuristic, inspired by a stochastic process called Chinese Restaurant Process (CRP). Basic idea is to use CRP to drive a clustering process and summing word vectors in the right cluster.

Imagine we have an document about chicken recipe. It contains words like “chicken”, “pepper”, “salt”, “cheese”. It also contains words like “use”, “buy”, “definitely”, “my”, “the”. The word2vec model gives us a vector for each word. One could naively sum up every word vector as the doc vector. This clearly introduces lots of noise. A better heuristic is to use a weighted sum, based on other information like idf or Part of Speech (POS) tag.

The question is: could we be more selective when adding terms? If this is a chicken recipe document, I shouldn’t even consider words like “definitely”, “use”, “my” in the summation. One can argue that idf based weights can significantly reduce noise of boring words like “the” and “is”. However, for words like “definitely”, “overwhelming”, the idfs are not necessarily small as you would hope.

It’s natural to think that if we can first group words into clusters, words like “chicken”, “pepper” may stay in one cluster, along with other clusters of “junk” words. If we can identify the “relevant” clusters, and only summing up word vectors from relevant clusters, we should have a good doc vector.

This boils down to clustering the words in the document. One can of course use off-the-shelf algorithms like K-means, but most these algorithms require a distance metric. Word2vec behaves nicely by cosine similarity, this doesn’t necessarily mean it behaves as well under Eucledian distance (even after projection to unit sphere, it’s perhaps best to use geodesic distance.)

It would be nice if we can directly work with cosine similarity. We have done a quick experiment on clustering words driven by CRP-like stochastic process. It worked surprisingly well — so far.

Chinese Restaurant Process

Now let’s explain CRP. Imagine you go to a (Chinese) restaurant. There are already n tables with different number of peoples. There is also an empty table. CRP has a hyperparamter r > 0, which can be regarded as the “imagined” number of people on the empty table. You go to one of the (n+1) tables with probability proportional to existing number of people on the table. (For the empty table, the number is r). If you go to one of the n existing tables, you are done. If you decide to sit down at the empty table, the Chinese restaurant will automatically create a new empty table. In that case, the next customer comes in will choose from (n+2) tables (including the new empty table).

Inspired by CRP, we tried the following variations of CRP to include the similarity factor. Common setup is the following: we are given M vectors to be clustered. We maintain two things: cluster sum (not centroid!), and vectors in clusters. We iterate through vectors. For current vector V, suppose we have n clusters already. Now we find the cluster C whose cluster sum is most similar to current vector. Call this score sim(V, C).

Variant 1: v creates a new cluster with probability 1/(1 + n). Otherwise v goes to cluster C.

Variant 2: If sim(V, C) > 1/(1 + n), goes to cluster C. Otherwise with probability 1/(1+n) it creates a new cluster and with probability n/(1+n) it goes to C.

In any of the two variants, if v goes to a cluster, we update cluster sum and cluster membership.

There is one distinct difference to traditional CRP: if we don’t go to empty table, we deterministically go to the “most similar” table.

In practice, we find these variants create similar results. One difference is that variant 1 tend to have more clusters and smaller clusters, variant 2 tend to have fewer but larger clusters. The examples below are from variant 2.

For example, for a chicken recipe document, the clusters look like this:

  • ‘cayenne’, ‘taste’, ‘rating’, ‘blue’, ‘cheese’, ‘raved’, ‘recipe’, ‘powdered’, ‘recipe’, ‘dressing’, ‘blue’, ‘spicier’, ‘spoon’, ‘cup’, ‘cheese’, ‘cheese’, ‘blue’, ‘blue’, ‘dip’, ‘bake’, ‘cheese’, ‘dip’, ‘cup’, ‘blue’, ‘adding’, ‘mix’, ‘crumbled’, ‘pepper’, ‘oven’, ‘temper’, ‘cream’, ‘bleu’, ……
  • ‘the’, ‘a’, ‘that’, ‘in’, ‘a’, ‘use’, ‘this’, ‘if’, ‘scant’, ‘print’, ‘was’, ‘leftovers’, ‘bring’, ‘a’, ‘next’, ‘leftovers’, ‘with’, ‘people’, ‘the’, ‘made’, ‘to’, ‘the’, ‘by’, ‘because’, ‘before’, ‘the’, ‘has’, ‘as’, ‘amount’, ‘is’, ……
  • ‘stars’, ‘big’, ‘super’, ‘page’, ‘oct’, ‘see’, ‘jack’, ‘photos’, ‘extras’, ‘see’, ‘video’, ‘one’, ‘page’, ‘f’, ‘jun’, ‘stars’, ‘night’, ‘jul’, ……

Apparently, the first cluster is most relevant. Now let’s take the cluster sum vector (which is the sum of all vectors from this cluster), and test if it really preserves semantic. Below is a snippet of python console. We trained word vector using the c implementation on a fraction of English Wiki, and read the model file using python library gensim.model.word2vec. c[0] below denotes the cluster 0.

>>> similar(c[0], model["chicken"])


>>> similar(c[0], model["recipe"] + model["chicken"])


>>> similar(c[0], model["recipe"] + model["fish"])


>>> similar(c[0], model["computer"])


>>> similar(c[0], model["scala"])


Looks like the semantic is preserved well. It’s convincing that we can use this as the doc vector.

The recipe document seems easy. Now let’s try something more challenging, like a news article. News articles tend to tell stories, and thus has less concentrated “topic words”. We tried the clustering on this article, titled “Signals on Radar Puzzle Officials in Hunt for Malaysian Jet”. We got 4 clusters:

  • ‘have’, ‘when’, ‘time’, ‘at’, ‘when’, ‘part’, ‘from’, ‘from’, ‘in’, ‘show’, ‘may’, ‘or’, ‘now’, ‘on’, ‘in’, ‘back’, ‘be’, ‘turned’, ‘for’, ‘on’, ‘location’, ‘mainly’, ‘which’, ‘to’,, ‘also’, ‘from’, ‘including’, ‘as’, ‘to’, ‘had’, ‘was’ ……
  • ‘radar’, ‘northwest’, ‘radar’, ‘sends’, ‘signal’, ‘signals’, ‘aircraft’, ‘data’, ‘plane’, ‘search’, ‘radar’, ‘saturated’, ‘handles’, ‘search’, ‘controlled’, ‘detection’, ‘data’, ‘nautical’, ‘patrol’, ‘detection’, ‘detected’, ‘floating’, ‘blips’, ‘plane’, ‘objects’, ‘jets’, ‘kinds’, ‘signals’, ‘air’, ‘plane’, ‘aircraft’, ‘radar’, ‘passengers’, ‘signal’, ‘plane’, ‘unidentified’, ‘aviation’, ‘pilots’, ‘ships’, ‘signals’, ‘satellite’, ‘radar’, ‘blip’, ‘signals’, ‘radar’, ‘signals’ ……
  • ‘of’, ‘the’, ‘of’, ‘of’, ‘of’, ‘the’, ‘a’, ‘the’, ‘senior’, ‘the’, ‘the’, ‘the’, ‘the’, ‘the’, ‘the’, ‘a’, ‘the’, ‘the’, ‘the’, ‘the’, ‘the’, ‘of’, ‘the’, ‘of’, ‘a’, ‘the’, ‘the’, ‘the’, ‘the’, ‘the’, ‘the’, ‘its’, ……
  • ‘we’, ‘authorities’, ‘prompted’, ‘reason’, ‘local’, ‘local’, ‘increasing’, ‘military’, ‘inaccurate’, ‘military’, ‘identifying’, ‘force’, ‘mistaken’, ‘expanded’, ‘significance’, ‘military’, ‘vastly’, ‘significance’, ‘force’, ‘surfaced’, ‘military’, ‘quoted’, ‘showed’, ‘military’, ‘fueled’, ‘repeatedly’, ‘acknowledged’, ‘declined’, ‘authorities’, ‘emerged’, ‘heavily’, ‘statements’, ‘announced’, ‘authorities’, ‘chief’, ‘stopped’, ‘expanding’, ‘failing’, ‘expanded’, ‘progress’, ‘recent’, ……

Again, looks decent. Note that this is a simple 1-pass clustering process and we don’t have to specify number of clusters! Could be very helpful for latency sensitive services.

There is still a missing step: how to find out the relevant cluster(s)? We haven’t yet done extensive experiments on this part. A few heuristics to consider:

  • idf weight
  • POS tags. We don’t have to tag every single word in the document. Empirically, word2vecs tend to group syntactically as well. So We only have to sample a few tags from each cluster.
  • compare cluster sum vectors with title vector (although this depends on the quality of title)

There are other problems to think about: 1) how do we merge clusters? Based on similarity among cluster sum vectors? Or averaging similarity between cluster members? 2) what is the minimal set of words that can reconstruct cluster sum vector (in the sense of cosine similarity)? This could be used as a semantic keyword extraction method.

Conclusion: Google’s word2vec provides powerful word vectors. We are interested in using these vectors to generate high quality document vectors in an efficient way. We tried a strategy based on a variant of Chinese Restaurant Process and obtained interesting results. There are some open problems to explore, and we would like to hear what you think.

Appendix: python style pseudo-code for similarity driven CRP
# vecs: an array of real vectors
def crp(vecs):
    clusterVec = []         # tracks sum of vectors in a cluster
    clusterIdx = []         # array of index arrays. e.g. [[1, 3, 5], [2, 4, 6]]
    ncluster = 0
    # probablity to create a new table if new customer
    # is not strongly "similar" to any existing table
    pnew = 1.0/ (1 + ncluster)  
    N = len(vecs)
    rands = random.rand(N)         # N rand variables sampled from U(0, 1)
    for i in range(N):
        maxSim = -Inf
        maxIdx = 0
        v = vecs[i]
        for j in range(ncluster):
            sim = cosine_similarity(v, clusterVec[j])
            if sim < maxSim:
                maxIdx = j
                maxSim = sim
            if maxSim < pnew:
                if rands(i) < pnew:
                    clusterVec[ncluster] = v
                    clusterIdx[ncluster] = [i]
                    ncluster += 1
                    pnew = 1.0 / (1 + ncluster)
        clusterVec[maxIdx] = clusterVec[maxIdx] + v
    return clusterIdx

We wrote this post while working on Kifi — tools that help people outsmart information overload together. Learn more.

Hooking in to Play! Framework’s sbt plugin lifecycle

In the Play documentation, we see how to hook in to the Play application’s life cycle with Global’s onStart, onStop, etc. However, sometimes you want to hook into the Play sbt plugin lifecycle for development or build purposes. Most usefully, you may want to run another task or process every time Play compiles, starts up (pre-Global), or stops (post-Global).

If you want to simply hook into the sbt compile step, you can do so before or after:

// Define your task
val yourCompileTask = TaskKey[Unit]("Your compile task")
val yourProject = play.Project("yourApp", "1.0").settings(
  // Example using a sbt task:
  yourCompileTask := {
    println("++++ sbt compile start!")
  (compile in Compile) &lt;&lt;= (compile in Compile) dependsOn (yourCompileTask),
  // Or just running Scala code:
  (compile in Compile) &lt;&lt;= (compile in Compile) map { result =&gt;
    println(s"++++ sbt compile end: $result")

More powerful, however, is to create a PlayRunHook:

object SimplePlayRunHook {
  import play.PlayRunHook
  def apply(base: File): PlayRunHook = {
    new PlayRunHook {
      override def beforeStarted(): Unit = {
        println(s"++++ simplePlayRunHook.beforeStarted: $base")
      override def afterStarted(address: InetSocketAddress): Unit = {
        println(s"++++ simplePlayRunHook.afterStarted: $address, $base")
      override def afterStopped(): Unit = {
        println("++++ simplePlayRunHook.afterStopped")
val yourProject = play.Project("yourApp", "1.0").settings(
  playRunHooks &lt;+= =&gt; SimplePlayRunHook(base))

Here’s what it looks like with both:

[kifi-backend] $ compile
++++ sbt compile start!
++++ sbt compile: Analysis: 3 Scala sources, 8 classes, 24 external source dependencies, 5 binary dependencies
[success] Total time: 8 s, completed Mar 3, 2014 11:51:16 AM
[kifi-backend] $ run

++++ simplePlayRunHook.beforeStarted: /Users/andrew/Documents/workspace/kifi/kifi-backend
--- (Running the application from SBT, auto-reloading is enabled) ---

[info] play - Listening for HTTP on /0:0:0:0:0:0:0:0:9000
++++ simplePlayRunHook.afterStarted: /0:0:0:0:0:0:0:0:9000, /Users/andrew/Documents/workspace/kifi/kifi-backend

(Server started, use Ctrl+D to stop and go back to the console...)

++++ simplePlayRunHook.afterStopped

[success] Total time: 12 s, completed Mar 3, 2014 11:51:32 AM

A common use case of these hooks is to trigger an auxiliary build process, like for a compiles-to-JS language or build tool such as Gulp/Grunt.

We wrote this post while working on Kifi — tools that help people outsmart information overload together. Learn more.