9 minute read

tl;dr: The faiss library allows to perform nearest neighbor search in an efficient way, scaling to several million dense vectors. Go straight to the example code!

A common procedure used in information retrieval and machine learning is to represent entities with low-dimensional dense vectors, also known as embeddings. These vectors typically have a number of dimensions between 25 and 1000 (we call them dense because the utmost majority of their components are non-zero, so they are not sparse).

Researchers have devised ways to compute vector embeddings for different kinds of entities: nowadays embeddings can be constructed for words, entire text documents, entire images, local features in images, nodes in graphs, and more (as testified by the existence of very many “2vec” models).

Such vector representations have interesting properties (think word2vec’s). One basic property is that similar vectors (i.e. vectors close to each other according to a given similarity metric, like the cosine similarity or the Euclidean distance) represent entities that are somehow closely related to each other. This feature is important as it unlocks the possibility of searching for entities relevant for a given query entity.

For example, let’s consider words. If we want to look up words that are related to a given word (e.g. synonyms), we can

  1. Compute appropriate vector representations xb for all $N$ words in a vocabulary $S$;
  2. Look up the vector representation xq of the query word;
  3. Select the vectors that are closest to xq (according to some given metric), among all those at step 1;
  4. Retrieve all words represented by the vectors selected in step 3 (those are the final results).

One could carry out a similar procedure using documents (Elasticsearch now supports this to retrieve similar documents), images (search engines for images can work this way) or small details in images (as done in feature matching).

If we can compute a “good” vector representation for entities (step 1 above) and have a viable algorithm to identify vectors close to a query vector (step 3, also known as k-nearest neighbors search, or k-NN search) we can build an efficient search engine.

In what follows we assume that step 1 has been solved for us and concentrate on how to carry out step 3, using Python and focusing on the cases of large $N$. Solving k-NN search has great industrial relevance: companies such as Spotify and Facebook have been developing libraries to solve this problem efficiently.

k-nearest neighbors search in Python

Given a set $S$ of $d$-dimensional $N$ vectors xb (the search space) and a query vector xq, how can we find its nearest neighbors in $S$ using Python?
If $N$ is large, the computation can be expensive, so it’s beneficial to leverage some level of optimization offered by dedicated numerical libraries.

In what follows we’ll analyze a solution using numpy, scikit-learn and finally faiss, that can search among several millions of dense vectors. We will use the Euclidean distance as similarity metric for vectors (code could be modified to use other metrics).

Linear search using numpy

One simple strategy is to compute the distance from xq to every other vector in $S$, and identify the $k$ vectors that have minimum distance. As all the possible matches are evaluated, this is also called brute force search. This operation requires the computation of $N$ distances and then finding the bottom $k$ values.

Once we have all the components of the vectors in $S$ stored in a numpy array xb, we can compute the indices of the k-nearest neighbors with

import numpy as np

N = 1000000
d = 100
k = 5

# create an array of N d-dimensional vectors (our search space)
xb = np.random.random((N, d)).astype('float32')
# create a random d-dimensional query vector
xq = np.random.random(d)
# compute distances
distances = np.linalg.norm(xb - xq, axis = 1)
# select indices of vectors having the lowest distances from the query vector (sorted!)
neighbors = np.argpartition(distances, range(0, k))[:k]

This is fairly straightforward, but it’s interesting to note that we’re using numpy.argpartition to select the indices of the vectors having the lowest distances (numpy.argpartition does the job efficiently on a CPU as it uses the introselect algorithm). Mind that we’re calling the function using range(0, k) as argument, because otherwise the neighbor indices wouldn’t be sorted by distance from xq!

This strategy is simple (it takes 2 lines of code!) but it requires the entire matrix of vectors to be stored in memory. This means that if the space taken in memory by the xb is above the RAM available, the code won’t run!

Search with scikit-learn

scikit-learn provides a dedicated class to solve the problem, sklearn.neighbors.NearestNeighbors. Searching in this case would look like

from sklearn.neighbors import NearestNeighbors

# set desired number of neighbors
neigh = NearestNeighbors(n_neighbors=k)
neigh.fit(xb)
# select indices of k nearest neighbors of the vectors in the input list
neighbors = neigh.kneighbors([xq], return_distance = False)

Using sklearn offers some advantages, as it automatically employs heuristics to determine if it should resort to computational tricks (like k-d trees) to reduce the number of distance calculations. A variety of metrics can be chosen to pick neighbors, and searching using multiple query vectors can be done by adding more vectors in the list passed to neigh.kneighbors().

One can also save to disk neigh for further reuse with

from joblib import dump, load
dump(neigh, "my_fitted_nn_estimator")

and load it again with

neigh = load("my_fitted_nn_estimator")

Last but not least, the sklearn-based code is arguably more readable and the use of a dedicated library can help avoid bugs (see e.g. the numpy.argpartition caveat above) that may be inadvertently introduced in the code.

However, if the search space is large (say, several million vectors), both the time needed to compute nearest neighbors and RAM needed to carry out the search may be large. We thus need additional tricks to solve the problem!

haystack

Search with faiss, and scale beyond RAM constraints

One library that offers a more sophisticated bag of tricks to perform the search is faiss. From their wiki on GitHub: “Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM”. The possibility of searching in vector sets not fitting in RAM is useful and numpy, sklearn, even Spotify’s annoy can’t do that AFAIK.

Exact brute-force searches like the one done above with numpy can be replicated with the syntax:

import faiss

index = faiss.index_factory(d, "Flat")
index.train(xb)
index.add(xb)
distances, neighbors = index.search(xq.reshape(1,-1).astype(np.float32), k)

So, why the fuss? Well, what’s cool about faiss is that it allows to strike a balance between accuracy (i.e. not returning all the true k-nearest neighbors, but just “good guesses”), speed and RAM requirements when dealing with large $N$. This is possible thanks to the precomputation of data structures (the indices) that store vectors in a clever way, and by tweaking their parameters. The library is also designed to take advantage of GPU architectures to speed up search.

One class of tricks used to speed up search is the pruning of $S$, i.e. dividing up $S$ into “buckets” (Voronoi cells in $d$ dimensions) and probing for nearest neighbors only some number nprobe of such buckets. While this procedure can miss some of the true nearest neighbors, it can greatly accelerate the search.

faiss also implements compression strategies to speed up the distance computation and reduce memory use. By applying methods like product quantization (PQ) it is possible to obtain distances in an approximate (but faster) way, using table lookups instead of direct computation.

A more concrete case: searching in a 1M dataset with faiss

The choice of which “tricks” to use for a specific problem depends on considerations about the raw input (dataset size, the spatial organization of the vectors), the available hardware (CPU vs GPU, available RAM, single node vs distributed processing), desired accuracy, speed, and total number of searches we need to perform. The faiss wiki on GitHub can help evaluate the different options.

Let’s examine more in detail a case in which:

  • $N \approx 10^6$;
  • search is performed in a Docker container running on CPU (single machine) and very few GBs of RAM are available; we can instead rely on a machine with more RAM to build the index;
  • accuracy is more important than speed: ideally we’d like to have exact results;
  • we plan to perform several searches (>10000) in the lifetime of an index.

To play with a realistic dataset, let’s use the GIST 1M vector dataset (GIST vectors can be used in computer vision to represent entire images). We can download and inflate the dataset with a Linux shell using wget and tar:

wget ftp://ftp.irisa.fr/local/texmex/corpus/gist.tar.gz
tar -xzvf gist.tar.gz

Moving to a Python shell, with appropriate helper functions, one can read the file gist_base.fvecs (3.57 GB in size) into a numpy array xb of shape (1000000, 960):

xb = fvecs_read("./gist/gist_base.fvecs")

As we plan to perform several searches (see above), we can assume that precomputing can be helpful. As we are limited by RAM but the dataset is not huge, we use pruning but not compression. We thus opt for a IVF4000,Flat index that organizes the vectors in 4000 buckets in $d=960$ dimensional space:

index = faiss.index_factory(xb.shape[1], "IVF4000,Flat")

We then need to train the index so to cluster the vectors that are added to it. Instead of doing this as above (i.e. with index.train(xb) and index.add(xb)), we train the index with a subset of the vectors xb[0:batch_size], add vectors to sub-blocks of the index in batches of 100000 vectors each, as this allows to limit RAM consumption at search time (see also demo_ondisk_ivf.py in the faiss GitHub repository):

batch_size = 100000

index.train(xb[0:batch_size])
faiss.write_index(index, "trained.index")

n_batches = xb.shape[0] // batch_size
for i in range(n_batches):
    index = faiss.read_index("trained.index")
    index.add_with_ids(xb[i * batch_size:(i + 1) * batch_size], np.arange(i * batch_size, (i + 1) * batch_size))
    faiss.write_index(index, f"block_{i}.index")

Finally, we can save the final index to disk with:

index = faiss.read_index("trained.index")
block_fnames = [f"block_{b}.index" for b in range(n_batches)]

merge_ondisk(index, block_fnames, "merged_index.ivfdata")
faiss.write_index(index, "populated.index")

While the above construction is a bit cumbersome and slow, it can be repaid off if we perform enough searches later down the line.

To re-read the index from disk we can use

index = faiss.read_index("populated.index")

May we need to recover the i-th vector in xb, we could use the syntax

i = 42
index.make_direct_map()
index.reconstruct(i).reshape(1,-1).astype(np.float32)

Finally, we can perform the search for a set of 1000 query vectors xq. We carry out the search within a limited number of nprobe cells with

xq = fvecs_read("./gist/gist_query.fvecs")

index.nprobe = 80
distances, neighbors = index.search(xq, k)

The code above retrieves the correct result for the 1st nearest neighbor in 95% of the cases (better accuracy can be obtained by setting higher values of nprobe). Memory consumption can be kept at bay: the search succeeds within a Docker container with RAM capped at 2GB, as shown by running the code with mprof run:

faiss-run

The equivalent numpy code needs more than 8GB to run (it crashes otherwise!), and runs slower:

numpy-run

The GIST dataset is not huge, but the example above shows that faiss can be helpful to tackle cases in which numpy or sklearn struggle, and can be modified (e.g. by using other indices) to handle even larger vector sets.

That’s it! The GIST example above can be reproduced using code on https://github.com/davidefiocco/faiss-on-disk-example.

Search on

The faiss documentation is on its GitHub wiki (the wiki contains also references to research work at the foundations of the library).
An introductory talk about faiss by its core devs can be found on YouTube, and a high-level intro is also in a FB engineering blogpost.
More code examples are available on the faiss GitHub repository.
The website ann-benchmarks.com contains the results of benchmarks run with different libraries for approximate nearest neighbors search (including faiss).
Another helpful GitHub repository containing faiss usage tips is https://github.com/matsui528/faiss_tips.
A nice tutorial on the library can be found at https://www.pinecone.io/learn/faiss-tutorial/ too. Enjoy!

Updated: