## Applying word2vec to Recommenders and Advertising

In this article, I wanted to share about a trend that’s occurred over the past few years of using the word2vec model on not just natural language tasks, but on recommender systems as well.

The key principle behind word2vec is the notion that the meaning of a word can be inferred from it’s context--what words tend to be around it. To abstract that a bit, text is really just a sequence of words, and the meaning of a word can be extracted from what words tend to be just before and just after it in the sequence.

What researchers and companies are finding is that the time series of online user activity offers the same opportunity for inferring meaning from context. That is, as a user browses around and interacts with different content, the abstract qualities of a piece of content can be inferred from what content the user interacts with before and after. This allows ML teams to apply word vector models to learn good vector representations for products, content, and ads.

### So What?

word2vec (and other word vector models) have revolutionized Natural Language Processing by providing much better vector representations for words than past approaches. In the same way that word embeddings revolutionized NLP, item embeddings are revolutionizing recommendations.

User activity around an item encodes many abstract qualities of that item which are difficult to capture by more direct means. For instance, how do you encode qualities like "architecture, style and feel" of an Airbnb listing?

The word2vec approach has proven successful in extracting these hidden insights, and being able to compare, search, and categorize items on these abstract dimensions opens up a lot of opportunities for smarter, better recommendations. Commercially, Yahoo saw a 9% lift in CTR when applying this technique to their advertisements, and AirBNB saw a 21% lift in CTR on their Similar Listing carousel, a product that drives 99% of bookings along with search ranking.

### Four Production Examples

Every e-commerce, social media, or content sharing site has this kind of user activity, so this approach appears to have very wide applicability. In the following sections I’ll summarize four publicized use cases of this technique which have been put into production:

• Music recommendations at Spotify and Anghami
• Listing recommendations at Airbnb
• Product recommendations in Yahoo Mail

### Music Recommendations

Erik Bernhardsson, who developed the approximate nearest neighbors library "Annoy" while at Spotify, has mentioned (in a meet up talk here and in a blog post here) that Spotify experimented with this technique while he was there. The most informative discussion I’ve read of this technique, however, comes from Ramzi Karam at Anghami, a popular music streaming service in the Middle East.

In his well illustrated article, Ramzi explains how a user’s listening queue can be used to learn song vectors in the same way that a string of words in text is used to learn word vectors. The assumption is that users will tend to listen to similar tracks in sequence.

Ramzi says Anghami streams over 700 million songs a month. Each song listen is like a single word in a text dataset, so Anghami’s listening data is comparable in scale to the billion-word Google News dataset that Google used to train their published word2vec model.

Ramzi also shares some helpful insights into how these song vectors are then used. One use is to create a kind of "music taste" vector for a user by averaging together the vectors for songs that a user likes to listen to. This taste vector can then become the query for a similarity search to find songs which are similar to the user’s taste vector. The article includes a nice animation to illustrate how this works.

### Listing Recommendations at Airbnb

Another highly informative article on this subject comes from Mihajlo Grbovic at Airbnb (you can follow him on Twitter

Imagine you are looking for an apartment to rent for your vacation to Paris. As you browse through the available homes, it’s likely that you will investigate a number of listings which fit your preferences and are comparable in features like amenities and design taste.

Here the user activity data takes the form of click data, specifically the sequence of listings that a user viewed. Airbnb is able to learn vector representations of their listings by applying the word2vec approach to this data.

Airbnb made a few interesting adaptations to the general approach in order to apply it to their site.

An important piece of the word2vec training algorithm is that for each word that we train on, we select a random handful of words (which are not in the context of that word) to use as negative samples. Airbnb found that in order to learn vectors which could distinguish listings within Paris (and not just distinguish Paris listings from New York listings), it was important that the negative samples be drawn from within Paris.

Grbovic also explains their solution to the cold start problem (how to learn vectors for new listings for which there isn’t user activity data)--they simply averaged the vectors of the geographically closest three listings to the new listing to create an initial vector.

These listing vectors are used to identify homes to show in the "similar listings" panel on their site, which Grbovic says is an important driver of bookings on their site.

This was not the first time Grbovic has applied the word2vec model to recommendations--he worked on the next two examples as well while at Yahoo Labs.

### Product Recommendations in Yahoo Mail

Yahoo published a research paper detailing how they use purchase receipts in a user’s email inbox to form a purchase activity, allowing them in turn to learn product feature vectors which can be used to make product recommendations. Like the other applications presented here, it's worth noting that this approach has gone into an actual production system.

Since online shoppers receive e-mail receipts for their purchases, mail clients are in a unique position to see user purchasing activity across many different e-commerce sites. Even without this advantage, though, Yahoo’s approach seems applicable and potentially valuable to online retailers as well.

Here, the user activity is their sequence of product purchases extracted from the e-mail receipts. The word2vec approach is able to learn embeddings for the products based on the assumption that shoppers often buy related items in sequence. For instance, perhaps they are purchasing a number of items around the same time because they go together (like a camera and a lens), or because they are all part of a project they are working on, or because they are preparing for a trip they are taking. A user's sequence of purchases may also reflect shopper taste--shoppers who like item A also tend to like item B and are interested in buying both.

This may seem a little strange given that the products we purchase aren’t always related. But remember that the model is governed by the statistics of user behavior--if two products are related, they are going to be purchased together by many users (thereby occurring together more in the training data, and having more influence on the model), while any two unrelated products will be purchased together much less often and have little influence on the model.

Yahoo augmented the word2vec approach with a few notable innovations. The most interesting to me was their use of clustering to promote diversity in their recommendations. After learning vectors for all the products in their database, they clustered these vectors into groups. When making recommendations for a user based on a product the user just purchased, they don’t recommend products from within the same cluster. Instead, they identify which other clusters users most often purchase from after purchasing from the current cluster, and they recommend products from those other clusters instead.

After purchasing a product from the Nerf cluster, customers are most likely to purchase a product from either the Xbox, Pokemon, or LEGO clusters. (Note that the paper doesn’t provide example cluster contents, so the ones above are entirely fabricated).

I imagine this clustering technique helps eliminate the problem of recommending the equivalent of product "synonyms". That is, imagine if the system only ever recommended different versions of the same item you just bought--"You just bought Energizer AA batteries? Check out these Duracell AA batteries!" Not very helpful.

To choose specific products in those clusters to recommend, in each cluster they find the ‘k’ most similar products to the purchased product, and recommend all of these to the user. In their experiments, they identified 20 products to recommend to the user per day. Each time a user interacts with the Yahoo Mail client, the client displays another one of the product recommendations in the form of another ad.

The paper also describes a technique called "bagging" where they assign additional weight to products which appear on the same purchase receipt.

### Matching Ads to Search Queries

In this application (published here), the goal is to learn vector representations for search queries and for advertisements in the same "embedding space", so that a given search query can be matched against available advertisements in order to find the most relevant ads to show the user.

This is a particularly fascinating application given the number of daunting challenges this problem poses (and yet they solved them and put this into production!). How do you learn vectors for queries and ads at the same time? How could you possibly learn a vector for search queries given the seemingly infinite number of possible queries out there? And how do you apply this when new queries and new ads are appearing daily?

The solution to learning vectors for queries and ads at the same time is fairly straightforward. The training data consists of user "search sessions" which consist of search queries entered, advertisements clicked, and search result links clicked. The sequence of these user actions is treated like the words in a sentence, and vectors are learned for each of them based on their context--the actions that tend to occur around them. If users often click a particular ad after entering a particular query, then we’ll learn similar vectors for the ad and the query. All three types of actions (searches entered, ads clicked, links clicked) are part of a single "vocabulary" for the model, such that the model doesn’t really distinguish these from one another.

Even though this application is focused on matching search queries and advertisements, the addition of the link clicks provides additional context to inform the other embeddings. (Though it’s not the subject of their research paper, you have to wonder whether they might use these link embeddings to help in ranking search results...)

What about the scale of this problem? Surely there are too many unique queries, links, and ads for this to be feasible? This is at least partially addressed by only keeping items in the vocabulary which occur more than 10 times in the training data; they imply that this filters out a large portion of unique searches. Still, the remaining vocabulary is massive--their training set consisted of "126.2 million unique queries, 42.9 million unique ads, and 131.7 million unique links", a total of 300.8 million items.

They point out that, for 300-dimensional embeddings, a vocabulary of 200 million items would require 480GB of RAM, so training this model on a single server is out of the question. Instead, they detail a distributed approach that made this training feasible.

So they’ve tackled the scale problem, but what about the severe cold start problem caused by the daily addition of new search queries and ads? I found the details of their methods here difficult to follow, but the general strategy appears to be matching words in the new material to words in old material, and leveraging those existing embeddings to create an initial embedding for the new content.

This method was placed into production, but apparently not all advertisements are chosen this way--they say that their approach is "currently serving more than 30% of all broad matches", where "broad matches" refers to attempts to match queries to ads based on the intent of the queries rather than on simple keyword matching.

### Similarity Search

This approach of learning content vectors based on user activity data is an exciting development for Nearist, since the most common operation with these vectors is to perform a similarity search to make content recommendations.

Given the large amount of vectorized content and the large number of users for which recommendations need to be made, this similarity search becomes a challenging engineering problem. Companies like Spotify and Facebook have invested a great deal of effort into developing approximation techniques to make this search feasible at scale, but with a cost to accuracy and clearly a large cost in engineering effort. And when the quality of product recommendations directly affects revenue, accuracy matters!

At Nearist, we believe the efforts of machine learning teams are better spent on improving the quality of your vector representations than on solving similarity search at scale, and we’ve made it our goal to provide as much search capability as you need with minimal engineering effort.

## k-NN Benchmarks Part I – Wikipedia

This article is the first in a series comparing different available methods for accelerating large-scale k-Nearest Neighbor searches on high-dimensional vectors (i.e., 100 components or more). The emphasis here is on practicality versus novelty–that is, we’re focusing on solutions which are readily available and can be used in production applications with minimal engineering effort.

At Nearist, we have developed a product specifically for accelerating large-scale k-NN search, and have included its performance in these benchmarks to demonstrate it’s strengths. That said, we’ve tried to give each of the platforms a thorough evaluation, and hopefully you will find our results and experiences informative no matter what platform you choose.

This post currently contains the preliminary results of our experiments, and will be updated in the coming months with more data.

## The Benchmark

### Concept Search on Wikipedia

Each of our benchmark articles will look at performance on a different k-NN based task. In this article, we will be doing a k-NN search on roughly 4.2 million topic model vectors generated from all of the articles in English Wikipedia.

This is an example of a technique known as concept search–you take a document, generate a feature vector from it, and then compare it against the feature vectors for a repository of documents (in this case, Wikipedia articles) to find “conceptually similar” documents.

For example, if we use the Wikipedia article for Personal computer as our query for a k-NN search, we get back the results Homebuilt computerDesktop computer, and Macintosh 128k (the original Macintosh).

Parsing all of Wikipedia and generating feature vectors for the text is a big project in its own right. Fortunately, Radim Řehůřek has already done exactly this and shared his implementation in the gensim package in Python. I previously shared a post on our experience with this, and also shared a modified and commented version of the code.

The ‘gensim’ package supports multiple topic models for converting text into vectors. For these benchmarks, we chose to use Latent Dirichlet Analysis (LDA) with 128 topics (that is, 128 features in the feature vectors).

### Comparison criteria

Over this series of benchmarks, there are multiple aspects to k-NN performance that we are interested in and will evaluate:

• Latency for a single search
• Throughput for a batch of searches
• Memory load time for the GPU and Nearist hardware.
• Accuracy (as impacted by approximation techniques)
• Engineering effort

This first benchmarking post will focus on just the first two–latency and throughput.

## The Contenders

We evaluated three different k-NN implementations:

• k-NN in scikit-learn on a desktop PC (as a baseline)
• GPU acceleration with a Tesla K80
• A prototype version of Nearist’s Vector Search Accelerator (VSX), which uses a custom hardware architecture.

### Desktop CPU

The first platform was simply a brute-force search using the k-NN implementation in scikit-learn in Python, running on a desktop PC, which has a 4th generation Intel i7 4770 (from around 2013) with 16GB of RAM. This was intended merely to serve as a baseline for comparison of the other two platforms.

### Nvidia Tesla K80 GPU

#### knn-cuda library

There don’t appear to be many publicly available kNN GPU implementations; of those, knn-cuda appears to be the most established option.

Like the Nearist VSX, knn-cuda uses a brute-force method–that is, it calculates and sorts all of the distance values. It accelerates k-NN by leveraging the GPU’s ability to greatly accelerate the matrix-multiplication step at the center of the distance calculations. It also accelerates the sorting of the results.

A note on engineering effort with knn-cuda: While we were able to perform our experiments with the knn-cuda library as-is, the authors note that it “was written in the context of a Ph.D. thesis” and that it “should be adapted to the problem and to the specifications of the GPU used.” That is to say, knn-cuda doesn’t necessarily provide an off-the-shelf implementation of k-NN that is ready for a production application–some further work by a CUDA engineer is required to make best use of the GPU.

#### Amazon P2 instance

Amazon started offering the P2 instances containing Tesla K80s at the end of 2016. GPU users will note that the K80 is now several generations old, and newer Tesla cards would achieve higher performance in our benchmark. However, we wanted to choose the platform with the lowest barrier to entry, and Amazon instances are trusted and widely used in production applications. Amazon instances have the added benefit of scalability–the ability to increase or decrease the number of GPUs based on demand.

The Tesla K80 actually includes 2 GPU chips (two NVIDIA GK210s) on a single board. CUDA does not automatically divide the computation across the two GPUs for you, however. In order to leverage both GPUs on the K80, you must divide your task across them and then consolidate the results. For this article, we have chosen to simply use a single GK210. In future articles, we plan to publish results using multiple GK210s and even multiple K80 cards.

### Nearist Vector Search Accelerator (VSX)

The VSX Orion is a rack-mounted server appliance that fits in a 4U space. Inside the server are PCIe cards which contain Nearist’s custom hardware accelerators.

The Orion hardware architecture takes advantage of the easy parallelism in k-NN and divides the search task across hundreds of Vector Comparison Units (VCUs) contained on the PCIe cards in the server. Each VCU operates on a different chunk of the dataset, and then results are aggregated and consolidated by a CPU.

The VSX appliance is still in development, and today we have a much smaller scale prototype. Our prototype consists of a single PCIe card with only about 1/10th of the VCUs that the production version will have. Even this limited prototype outperformed both the CPU and GPU in our benchmark, however.

## Benchmark Results

### Latency Measurements

The following table shows the time taken (in seconds) to perform a k-NN search against the 4.2 million wikipedia articles with only a single query vector. The measurements are all averaged over 5 runs.

Here again is a summary of the platforms:

• Scikit-learn - Python library running on an Intel i7 4770
• GPU - A single GK210 GPU in a Tesla K80
• Nearist - A 1/10th scale prototype Nearist board

These results are preliminary, and we will update them in the coming months. Note that in these current results we are only using half of the compute power on the K80 (since we are only using one GK210 chip), and that the prototype Nearist board only has about 1/10th of the compute power of the production version.

Latency in seconds

 scikit-learn GPU Nearist K = 1 0.60 3.87 0.17 K = 10 1.23 3.80 0.17 K = 100 1.85 3.86 0.17

Notice that the GPU is actually slower than the CPU when only presented with one query at a time. One reason is because GPUs are more efficient at performing matrix-matrix multiplication (multiple query vectors against the dataset) than they are at performing vector-matrix multiplication (a single query vector against the dataset).

Also notice how the GPU implementation is relatively unaffected by the value of k, whereas scikit-learn performance decreases with larger values of k. The scikit-learn behavior suggests that they might be optimizing the sorting technique based on the value of k.

The Nearist hardware also is relatively unaffected by the value of k; this is due to efficient parallelization of the sorting process in the Nearist architecture. Increasing k does increase the Nearist latency, but the effect is only on the order of a few milliseconds, and so it isn’t visible here.

### Throughput Measurements

The next table shows the average throughput (in queries per second) achieved by each of the different platforms when presented with an input batch of 100 query vectors.

The throughput tends to be higher than just 1 / latency. This is because the PC, GPU, and Nearist hardware all have the ability to process multiple queries in parallel to varying degrees. The throughput is reported as the average number of queries per second (calculated as 100 queries / total processing time).

The measurements are all averaged over 5 runs.

Throughput in queries / second

 scikit-learn GPU Nearist K = 1 26.15 25.74 94.12 K = 10 10.51 26.32 94.12 K = 100 6.61 20.40 94.12

The throughput numbers show the same behaviors surrounding the value of k, with the CPU performance being the most impacted by the value of k.

This is also where we start to see the GPU outperforming the CPU, when it is able to operate on more query vectors simultaneously.

## Conclusions

The Nearist prototype board only has 1/10th of the performance of the production version, but already shows a more than 10x improvement over the CPU in throughput and latency for k-NN with k=100. Against the GPU, the Nearist prototype board has more than 20x better latency, and more than 4x greater throughput at k=100.

In the coming months, we will update this post with GPU measurements using both chips on the Tesla K80 card (which should give a roughly 2x improvement), and we’ll also update the Nearist results using the full-performance production card (which should give roughly a 10x improvement).

In our next benchmarking post, we will look at a classification task which will allow us to compare accuracy in addition to the throughput and latency. We will also add results using the Annoy library in Python, which uses approximate nearest neighbor techniques.

## Concept Search on Wikipedia

I recently created a project on GitHub called wiki-sim-search where I used gensim to perform concept searches on English Wikipedia.

gensim includes a script, make_wikicorpus.py, which converts all of Wikipedia into vectors. They’ve also got a nice tutorial on using it here.

I started from this gensim script and modified it heavily to comment and organize it, and achieve some more insight into each of the steps. I also included a few additional steps, like training and applying the LSI model, and performing similarity searches by article title.

## What it takes

Building a corpus from Wikipedia is a pretty lengthy process–about 12 hours from Wikipedia dump to a set of LSI vectors. It also uses a lot of hard disk space, as you can imagine.

Here’s a breakdown of the steps involved in my version of make_wikicorpus.py, along with the files generated and their sizes.

These numbers are from running on my desktop PC, which has an Intel Core i7 4770, 16GB of RAM, and an SSD.

 # Step Time (h:m) Output File File Size 0 Download Wikipedia Dump -- enwiki-latest-pages-articles.xml.bz2 12.6 GB 1 Parse Wikipedia & Build Dictionary 3:12 dictionary.txt.bz2 769 KB 2 Convert articles to bag-of-words vectors 3:32 bow.mm 9.44 GB 2a. Store article titles -- bow.mm.metadata.cpickle 152 MB 3 Learn tf-idf model from document statistics 0:47 tfidf.tfidf_model 4.01 MB 4 Convert articles to tf-idf 1:40 corpus_tfidf.mm 17.9 GB 5 Learn LSI model with 300 topics 2:07 lsi.lsi_model 3.46 MB lsi.lsi_model.projection 3 KB lsi.lsi_model.projection.u.npy 228 MB 6 Convert articles to LSI 0:58 lsi_index.mm 1 KB lsi_index.mm.index.npy 4.69 GB TOTALS 12:16 45 GB

The original gensim script stops after step 4.

Notice how it parses Wikipedia twice–in steps 1 and steps 2. The alternative would be that, on the first pass, you write out the extracted tokens to another file (there’s no way you could keep them all in memory). If you want to save a little bit of time, I included my compiled dictionary in the project, so that you can skip over step 1.

## Insights into Wikipedia

It’s fun to look at the statistics on Wikipedia.

These statistics are from the Wikipedia dump that I pulled down on 1/18/17.

 17,180,273 Total number of articles (without any filtering) 4,198,780 Number of articles after filtering out "article redirects" and "short stubs" 2,355,066,808 Total number of tokens in all articles (without any filtering) 2,292,505,314 Total number of tokens after filtering articles 8,746,676 Total number of unique words found in all articles (*after* filtering articles)

In summary:

• There are ~4.2M Wikipedia articles with real content.
• There are ~2.3B words across all of these articles, which means the average article length is 762 words.
• There are 8.75M unique words in Wikipedia.

Once you have the LSI vectors, you can search wikipedia to find the most similar articles to a specified article.

As a fun example, I searched for the top 10 articles conceptually similar to Topic model.

The results look pretty reasonable to me:

It’s interesting, I didn’t know about most of these related articles. I had never heard of “Online content analysis”, and wouldn’t have thought to look at “Semantic similarity” in researching topic modeling. A concept search like this seems pretty helpful for researching topics.

## gensim Web App

The gensim guys turned this concept search of Wikipedia functionality into a cool little web app here.

It uses an approximate nearest neighbor library called Annoy to deliver fast results.

If you search by the article ‘Topic model’ using their web app, the results don’t seem nearly as good:

## Top words per topic

It’s interesting to look at the top words per topic. You can see the top 10 words for each of the 300 learned topics here.

Some of them make sense:

• Topic #37: democratic, republican, trump, hillary, racing, airport, pt, huckabee, obama, clinton,
• Topic #51: ef, tornado, tropical, airport, church, damage, utc, storm, url, fc,
• Topic #114: forests, stm, broadleaf, shrublands, estero, subtropical, palearctic, grasslands, moist, utc,

But I’d say most of the topics are pretty confusing. There are a lot of words that show up which don’t seem like they should be nearly so important. For example:

• ‘saleen’ (a car manufacturer) appears in 15 topics
• ‘astragalus’ (a species of plant) appears in 7
• ‘southampton’ (a city in England) appears in 15 topics

There are also a lot of short words that seem like they may be HTML or stlying tags that are somehow not getting parsed out. For example, ‘http’, ‘ft’, ‘bgcolor’, ‘alt’.

I’m hoping to dig into the parsing code a bit and see if I can’t improve the results.

## Memory Concerns

Performing similarity searches on Wikipedia gets interesting because of the matrix size.

I used LSI with 300 topics, so the matrix is:

That’s a lot of RAM!

For my implementation, I used gensim.similarities.MatrixSimilarity, which loads the entire set of LSI vectors into memory.

However, gensim also includes a class gensim.similarities.Similarity that allows you to “shard” the matrix so that you can keep it on disk and process it in chunks.

It would be interesting to try this and compare search performance.

## Getting Started with mlpack

I’ve recently needed to perform a benchmarking experiment with k-NN in C++, so I found mlpack as what appears to be a popular and high-performance machine learning library in C++.

I’m not a very strong Linux user (though I’m working on it!), so I actually had a lot of trouble getting up and going with mlpack, despite their documentation.

In this guide, I’ll cover the steps needed to get up and running, but also offer some explanation of what’s going in each. So if you’re already an expert at working with C++ libraries in Linux, you may find this post pretty boring :).

I’m currently running Ubuntu 16.04, so some of this may be Ubuntu-specific.

The Ubuntu package manager helps you get mlpack installed as well as any dependencies (and it appears that mlpack has a lot of dependencies on, e.g., other linear algebra libraries).

Note that the package manager is different from the “Ubuntu Software Center”. The software center is more like an app-store, and you won’t find mlpack there.

The name of the package is ‘libmlpack-dev’. This is going to install the mlpack libraries and header files for you–it does not include the source code for mlpack, which you don’t need if you’re just planning to reference the libraries. It also does not include any source examples. They provide a couple code examples as texton their website; to run these you would create your own .cpp file and paste in the code (but you also need to supply your own dataset! 0_o). More on example code later.

I found the package name a little confusing (why isn’t it just “mlpack”?), so here are some clarifications on the “lib” and “-dev” parts of the package name:

• Dynamically-linked libraries like mlpack all have ‘lib’ prepended to their package name (like liblapack, libblas, etc.).
• The Dynamic Linker in Linux (called “ld”) requires dynamically-linked libraries to have the form lib.so (Reference).
• “.so” stands for “Shared Object”, and it’s analogous to DLLs on Windows.
• The “-dev” suffix on the package name is a convention that indicates that this package contains libraries and header files that you can compile against, as opposed to just executable binaries. (Reference)

Another thing that confused me–how would you know the name of the package you want to install if all you know is that the project is called “mlpack”?

This page provides a nice tutorial (with a lot of detail) about how you can find packages and learn about them. Here’s the command that I would have found most helpful, though: apt-cache search 'mlpack'. Those single quotes around mlpack are actually wildcards–they allow it to match any package with mlpack anywhere in the name.

Here’s what each of those packages provides.

• libmlpack-dev - If you are going to write code that references the mlpack libraries, this is what you need.
• libmlpack2 - If you’re not programming with mlpack, but you’re using an application that uses the mlpack libraries, then you’d just need this package with the “runtime library” (the dynamically-linked library).
• mlpack-bin - The mlpack project actually includes command line tool versions of some of the machine learning algorithms it implements. So, for example, you could run k-means clustering on a dataset from the command line without doing any programming. This package contains those binaries.
• mlpack-doc - Documentation for the libraries.

So to write our own code using the mlpack libraries, we just need libmlpack-dev. Grab it with the APT (Advanced Packaging Tool) package manager with the following command:

This will install mlpack and all of the libraries it depends on. Except one, apparently–you’ll also need to install Boost:

Maybe Boost was left out of the dependency list because it’s so commonly used? I don’t know.

# Install location

Something that left me pretty confused from the installation was that I had no idea where mlpack was installed to. (Mainly, I wanted to know this because I assumed it would have installed some example source files for me somewhere, but I learned later that it doesn’t include any.)

To list out all of the files installed by mlpack, use this command:

There are some default locations for libraries in Linux, and that’s where you’ll find mlpack:

• It installs lots of headers under /usr/include/mlpack/.
• It installs the library file to /usr/lib/x86_64-linux-gnu/libmlpack.so

These default locations are already part of the path for gcc / g++, so you’re all set to #include the mlpack header files in your code!

What's "/usr/"? Is that like my User directory on Windows? (Answer: No.) Linux usually starts you out in your ‘home’ directory, e.g. /home/chrismcc/. This is where you find your personal files (documents, desktop, pictures, etc.). You can also refer to your home directory by just tilde ‘~’. This used to confuse me because I assumed ~ was the symbol for the root of the file system--it’s not! just ‘/’ is the root directory. /usr/ is a directory under root where installed software lives.

# Compiling and Running an Example

As a first example, we’ll use the sample code from the mlpack site for doing a nearest neighbor search.

This very simple example takes a dataset of vectors and finds the nearest neighbor for each data point. It uses the dataset both as the reference and the query vectors.

I’ve taken their original example and added some detailed comments to explain what’s going on.

Relevant Documentation:

Save the following source code in a file called knn_example.cpp:

And save this toy dataset as data.csv:

To compile the example, you’ll use g++ (the C++ equivalent of gcc).

• knn_example.cpp - The code to compile.
• -o knn_example - The binary (executable) to output.
• -std=c++11 - mlpack documentation says you need to set this flag.
• -larmadillo -lmlpack -lboost_serialization - The “-l” flag tells the linker to look for these libraries.
• armadillo is a linear algebra library that mlpack uses.

Finally, to run the example, execute the binary:

Don't leave out the "./"! In Windows, you can just type the name of an executable in the current directory and hit enter and it will run. In Linux, if you want to do the same you need to prepend the "./".

And you should see the following output:

Nearest neighbor of point 0 is point 7 and the distance is 1.
Nearest neighbor of point 1 is point 2 and the distance is 0.
Nearest neighbor of point 2 is point 1 and the distance is 0.
Nearest neighbor of point 3 is point 10 and the distance is 1.
Nearest neighbor of point 4 is point 11 and the distance is 1.
Nearest neighbor of point 5 is point 12 and the distance is 1.
Nearest neighbor of point 6 is point 12 and the distance is 1.
Nearest neighbor of point 7 is point 10 and the distance is 1.
Nearest neighbor of point 8 is point 9 and the distance is 0.
Nearest neighbor of point 9 is point 8 and the distance is 0.
Nearest neighbor of point 10 is point 9 and the distance is 1.
Nearest neighbor of point 11 is point 4 and the distance is 1.
Nearest neighbor of point 12 is point 9 and the distance is 1.


You’re up and running!

## Word2Vec Tutorial Part 2 – Negative Sampling

In part 2 of the word2vec tutorial (here’s part 1), I’ll cover a few additional modifications to the basic skip-gram model which are important for actually making it feasible to train.

When you read the tutorial on the skip-gram model for Word2Vec, you may have noticed something–it’s a huge neural network!

In the example I gave, we had word vectors with 300 components, and a vocabulary of 10,000 words. Recall that the neural network had two weight matrices–a hidden layer and output layer. Both of these layers would have a weight matrix with 300 x 10,000 = 3 million weights each!

Running gradient descent on a neural network that large is going to be slow. And to make matters worse, you need a huge amount of training data in order to tune that many weights and avoid over-fitting. millions of weights times billions of training samples means that training this model is going to be a beast.

The authors of Word2Vec addressed these issues in their second paper.

There are three innovations in this second paper:

1. Treating common word pairs or phrases as single “words” in their model.
2. Subsampling frequent words to decrease the number of training examples.
3. Modifying the optimization objective with a technique they called “Negative Sampling”, which causes each training sample to update only a small percentage of the model’s weights.

It’s worth noting that subsampling frequent words and applying Negative Sampling not only reduced the compute burden of the training process, but also improved the quality of their resulting word vectors as well.

# Word Pairs and “Phrases”

The authors pointed out that a word pair like “Boston Globe” (a newspaper) has a much different meaning than the individual words “Boston” and “Globe”. So it makes sense to treat “Boston Globe”, wherever it occurs in the text, as a single word with its own word vector representation.

You can see the results in their published model, which was trained on 100 billion words from a Google News dataset. The addition of phrases to the model swelled the vocabulary size to 3 million words!

If you’re interested in their resulting vocabulary, I poked around it a bit and published a post on it here. You can also just browse their vocabulary here.

Phrase detection is covered in the “Learning Phrases” section of their paper. They shared their implementation in word2phrase.c–I’ve shared a commented (but otherwise unaltered) copy of this code here.

I don’t think their phrase detection approach is a key contribution of their paper, but I’ll share a little about it anyway since it’s pretty straightforward.

Each pass of their tool only looks at combinations of 2 words, but you can run it multiple times to get longer phrases. So, the first pass will pick up the phrase “New_York”, and then running it again will pick up “New_York_City” as a combination of “New_York” and “City”.

The tool counts the number of times each combination of two words appears in the training text, and then these counts are used in an equation to determine which word combinations to turn into phrases. The equation is designed to make phrases out of words which occur together often relative to the number of individual occurrences. It also favors phrases made of infrequent words in order to avoid making phrases out of common words like “and the” or “this is”.

You can see more details about their equation in my code comments here.

One thought I had for an alternate phrase recognition strategy would be to use the titles of all Wikipedia articles as your vocabulary.

# Subsampling Frequent Words

In part 1 of this tutorial, I showed how training samples were created from the source text, but I’ll repeat it here. The below example shows some of the training samples (word pairs) we would take from the sentence “The quick brown fox jumps over the lazy dog.” I’ve used a small window size of 2 just for the example. The word highlighted in blue is the input word.

There are two “problems” with common words like “the”:

1. When looking at word pairs, (“fox”, “the”) doesn’t tell us much about the meaning of “fox”. “the” appears in the context of pretty much every word.
2. We will have many more samples of (“the”, …) than we need to learn a good vector for “the”.

Word2Vec implements a “subsampling” scheme to address this. For each word we encounter in our training text, there is a chance that we will effectively delete it from the text. The probability that we cut the word is related to the word’s frequency.

If we have a window size of 10, and we remove a specific instance of “the” from our text:

1. As we train on the remaining words, “the” will not appear in any of their context windows.
2. We’ll have 10 fewer training samples where “the” is the input word.

Note how these two effects help address the two problems stated above.

### Sampling rate

The word2vec C code implements an equation for calculating a probability with which to keep a given word in the vocabulary.

wiwi is the word, z(wi)z(wi) is the fraction of the total words in the corpus that are that word. For example, if the word “peanut” occurs 1,000 times in a 1 billion word corpus, then z(‘peanut’) = 1E-6.

There is also a parameter in the code named ‘sample’ which controls how much subsampling occurs, and the default value is 0.001. Smaller values of ‘sample’ mean words are less likely to be kept.

P(wi)P(wi) is the probability of keeping the word:

P(wi)=(z(wi)0.001‾‾‾‾‾‾√+1)0.001z(wi)P(wi)=(z(wi)0.001+1)⋅0.001z(wi)

You can plot this quickly in Google to see the shape.

No single word should be a very large percentage of the corpus, so we want to look at pretty small values on the x-axis.

Here are some interesting points in this function (again this is using the default sample value of 0.001).

• P(wi)=1.0P(wi)=1.0 (100% chance of being kept) when z(wi)<=0.0026z(wi)<=0.0026.
• This means that only words which represent more than 0.26% of the total words will be subsampled.
• P(wi)=0.5P(wi)=0.5 (50% chance of being kept) when z(wi)=0.00746z(wi)=0.00746.
• P(wi)=0.033P(wi)=0.033 (3.3% chance of being kept) when z(wi)=1.0z(wi)=1.0.
• That is, if the corpus consisted entirely of word wiwi, which of course is ridiculous.
You may notice that the paper defines this function a little differently than what's implemented in the C code, but I figure the C implementation is the more authoritative version.

# Negative Sampling

Training a neural network means taking a training example and adjusting all of the neuron weights slightly so that it predicts that training sample more accurately. In other words, each training sample will tweak all of the weights in the neural network.

As we discussed above, the size of our word vocabulary means that our skip-gram neural network has a tremendous number of weights, all of which would be updated slightly by every one of our billions of training samples!

Negative sampling addresses this by having each training sample only modify a small percentage of the weights, rather than all of them. Here’s how it works.

When training the network on the word pair (“fox”, “quick”), recall that the “label” or “correct output” of the network is a one-hot vector. That is, for the output neuron corresponding to “quick” to output a 1, and for all of the other thousands of output neurons to output a 0.

With negative sampling, we are instead going to randomly select just a small number of “negative” words (let’s say 5) to update the weights for. (In this context, a “negative” word is one for which we want the network to output a 0 for). We will also still update the weights for our “positive” word (which is the word “quick” in our current example).

The paper says that selecting 5-20 words works well for smaller datasets, and you can get away with only 2-5 words for large datasets.

Recall that the output layer of our model has a weight matrix that’s 300 x 10,000. So we will just be updating the weights for our positive word (“quick”), plus the weights for 5 other words that we want to output 0. That’s a total of 6 output neurons, and 1,800 weight values total. That’s only 0.06% of the 3M weights in the output layer!

In the hidden layer, only the weights for the input word are updated (this is true whether you’re using Negative Sampling or not).

### Selecting Negative Samples

The “negative samples” (that is, the 5 output words that we’ll train to output 0) are chosen using a “unigram distribution”.

Essentially, the probability for selecting a word as a negative sample is related to its frequency, with more frequent words being more likely to be selected as negative samples.

In the word2vec C implementation, you can see the equation for this probability. Each word is given a weight equal to it’s frequency (word count) raised to the 3/4 power. The probability for a selecting a word is just it’s weight divided by the sum of weights for all words.

P(wi)=f(wi)3/4nj=0(f(wj)3/4)P(wi)=f(wi)3/4∑j=0n(f(wj)3/4)

The decision to raise the frequency to the 3/4 power appears to be empirical; in their paper they say it outperformed other functions. You can look at the shape of the function–just type this into Google: “plot y = x^(3/4) and y = x” and then zoom in on the range x = [0, 1]. It has a slight curve that increases the value a little.

The way this selection is implemented in the C code is interesting. They have a large array with 100M elements (which they refer to as the unigram table). They fill this table with the index of each word in the vocabulary multiple times, and the number of times a word’s index appears in the table is given by P(wi)P(wi) * table_size. Then, to actually select a negative sample, you just generate a random integer between 0 and 100M, and use the word at that index in the table. Since the higher probability words occur more times in the table, you’re more likely to pick those.

## Other Resources

For the most detailed and accurate explanation of word2vec, you should check out the C code. I’ve published an extensively commented (but otherwise unaltered) version of the code here.

I’ve also created a post with links to and descriptions of other word2vec tutorials, papers, and implementations.

### Cite

McCormick, C. (2017, January 11). Word2Vec Tutorial Part 2 - Negative Sampling. Retrieved from http://www.mccormickml.com