In the VIBE benchmark we consider datasets whose queries are both in-distribution and out-of-distribution.

The following table reports information about the size and dimensionality of each dataset, along with links that allow to download them.

Here below we report the first two PCA components of data and queries for each dataset. Selecting a dataset in the table above allows to update the visualization.

Along with the PCA we display the distribution of Mahalanobis distances between the data points and the data and the query points and the data.

To characterize the difficulty of queries, and hence of the workloads associated with each dataset, we consider the relative contrast, defined as [ RC_k = ] where \(d_{avg}\) is the average distance of the query to the other points, and \(d_k\) is the distance of the query from its \(k\)-th nearest neighbor.

The plot below reports the distribution of relative contrasts for \(k=100\) for the datasets1 in the benchmark, with datasets arranged in increasing order of difficulty, top to bottom.

Footnotes

  1. Datasets with inner product similarity are omitted from the plot, as the inner product is not a metric, and the relative constrast is not well-defined for non-metric distances.↩︎