DBSCAN is a popular clustering algorithm which is fundamentally very different from k-means.
In k-means clustering, each cluster is represented by a centroid, and points are assigned to whichever centroid they are closest to. In DBSCAN, there are no centroids, and clusters are formed by linking nearby points to one another.
k-means requires specifying the number of clusters, ‘k’. DBSCAN does not, but does require specifying two parameters which influence the decision of whether two nearby points should be linked into the same cluster. These two parameters are a distance threshold, εε (epsilon), and “MinPts” (minimum number of points), to be explained.
k-means runs over many iterations to converge on a good set of clusters, and cluster assignments can change on each iteration. DBSCAN makes only a single pass through the data, and once a point has been assigned to a particular cluster, it never changes.
I like the language of trees for describing cluster growth in DBSCAN. It starts with an arbitrary seed point which has at least MinPts points nearby within a distance (or “radius”) of εε. We do a breadth-first search along each of these nearby points. For a given nearby point, we check how many points it has within its radius. If it has fewer than MinPts neighbors, this point becomes a leaf–we don’t continue to grow the cluster from it. If it does have at least MinPts, however, then it’s a branch, and we add all of its neighbors to the FIFO queue of our breadth-first search.
Once the breadth-first search is complete, we’re done with that cluster, and we never revisit any of the points in it. We pick a new arbitrary seed point (which isn’t already part of another cluster), and grow the next cluster. This continues until all of the points have been assigned.
There is one other novel aspect of DBSCAN which affects the algorithm. If a point has fewer than MinPts neighbors, AND it’s not a leaf node of another cluster, then it’s labeled as a “Noise” point that doesn’t belong to any cluster.
Noise points are identified as part of the process of selecting a new seed–if a particular seed point doesn’t have enough neighbors, it’s labeled as a Noise point. This label is often temporary, however–these Noise points are often picked up by some cluster as a leaf node.
Naftali Harris has created a great web-based visualization of running DBSCAN on a 2-dimensional dataset. Try clicking on the “Smiley” dataset and hitting the GO button. (That’s where the image from this post came from). Very cool!
Algorithm in Python
To fully understand the algorithm, I think it’s best to just look at some code.
Below is a working implementation in Python. Note that the emphasis in this implementation is on illustrating the algorithm… the distance calculations, for example, could be optimized significantly.
You can also find this code (along with an example that validates it’s correctness) on GitHub here.
Nearist has recently benchmarked exhaustive (or “brute force”) k-NN search on a dataset of 1 billion image descriptors (the deep1b dataset). A single server containing Nearist’s Vector Search Accelerator (VSX) cards was able to find Read more…
In part 2 of the word2vec tutorial (Part I below), I’ll cover a few additional modifications to the basic skip-gram model which are important for actually making it feasible to train. When you read the Read more…
This tutorial covers the skip gram neural network architecture for Word2Vec. My intention with this tutorial was to skip over the usual introductory and abstract insights about Word2Vec, and get into more of the details. Read more…