k-NN Benchmarks Part I – Wikipedia

By on October 11th, 2017 in Uncategorized

This article is the first in a series comparing different available methods for accelerating large-scale k-Nearest Neighbor searches on high-dimensional vectors (i.e., 100 components or more). The emphasis here is on practicality versus novelty–that is, we’re focusing on solutions which are readily available and can be used in production applications with minimal engineering effort. At Nearist, […]

Read more

Concept Search on Wikipedia

By on February 22nd, 2017 in Uncategorized

I recently created a project on GitHub called wiki-sim-search where I used gensim to perform concept searches on English Wikipedia. gensim includes a script, make_wikicorpus.py, which converts all of Wikipedia into vectors. They’ve also got a nice tutorial on using it here. I started from this gensim script and modified it heavily to comment and organize it, and achieve some […]

Read more

Getting Started with mlpack

By on February 12th, 2017 in Uncategorized

I’ve recently needed to perform a benchmarking experiment with k-NN in C++, so I found mlpack as what appears to be a popular and high-performance machine learning library in C++. I’m not a very strong Linux user (though I’m working on it!), so I actually had a lot of trouble getting up and going with mlpack, despite […]

Read more

Word2Vec Tutorial Part 2 – Negative Sampling

By on January 12th, 2017 in Uncategorized

In part 2 of the word2vec tutorial (here’s part 1), I’ll cover a few additional modifications to the basic skip-gram model which are important for actually making it feasible to train. When you read the tutorial on the skip-gram model for Word2Vec, you may have noticed something–it’s a huge neural network! In the example I gave, […]

Read more

DBSCAN Clustering

By on November 8th, 2016 in Uncategorized

DBSCAN is a popular clustering algorithm which is fundamentally very different from k-means. In k-means clustering, each cluster is represented by a centroid, and points are assigned to whichever centroid they are closest to. In DBSCAN, there are no centroids, and clusters are formed by linking nearby points to one another. k-means requires specifying the […]

Read more