BiocNeighbors 1.21.1
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 7559 5935 8059 9426 7168 6496 2132 5010 358 9600
## [2,] 3164 5111 5786 9400 988 2749 4123 3123 47 3283
## [3,] 3269 234 1066 4141 6388 1599 892 4870 7685 1377
## [4,] 107 6858 2713 9039 8619 8281 256 3699 8485 8091
## [5,] 8680 3816 6206 6724 2243 9144 4790 7240 5611 9304
## [6,] 8527 5441 4604 5636 3526 9468 4983 166 5145 370
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8748013 0.8784494 0.9146183 0.9651812 0.9675552 0.9809601 1.0058558
## [2,] 0.7777827 0.9232483 0.9332730 0.9369563 0.9586143 0.9701444 0.9826863
## [3,] 0.8029577 0.8947443 0.9213590 0.9820651 0.9929726 1.0189129 1.0276979
## [4,] 0.8442335 0.9839671 1.0046166 1.1175454 1.1252342 1.1258414 1.1351572
## [5,] 0.8322262 0.9139543 0.9483393 1.0050924 1.0294942 1.0403750 1.0404503
## [6,] 0.7573267 0.9418661 0.9548041 0.9669248 0.9707791 0.9936019 1.0068357
## [,8] [,9] [,10]
## [1,] 1.0104431 1.013420 1.026168
## [2,] 0.9998363 1.016317 1.053129
## [3,] 1.0491771 1.066496 1.100463
## [4,] 1.1381686 1.143911 1.148446
## [5,] 1.0497428 1.060210 1.061107
## [6,] 1.0075078 1.017039 1.026355
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 7606 567 7112 4336 1756
## [2,] 3018 8262 500 3674 5626
## [3,] 845 1020 1663 5786 3551
## [4,] 1966 3106 845 535 5334
## [5,] 8913 6765 1890 196 198
## [6,] 2086 7346 1899 6545 4045
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.8578582 0.8884054 1.0041189 1.041651 1.105526
## [2,] 0.9961907 1.0888259 1.1145264 1.114570 1.130750
## [3,] 0.8934948 1.0029445 1.0453464 1.055168 1.083129
## [4,] 0.9385177 0.9388769 0.9829513 0.987389 1.010831
## [5,] 0.9597651 0.9951296 1.0010412 1.002832 1.020078
## [6,] 1.0127226 1.1398181 1.1507374 1.187975 1.195542
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpqUjH5S/file1800e532bd9bb1.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R Under development (unstable) (2023-11-11 r85510)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.21.1 knitr_1.45 BiocStyle_2.31.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.1 rlang_1.1.2 xfun_0.41
## [4] jsonlite_1.8.7 S4Vectors_0.41.2 htmltools_0.5.7
## [7] stats4_4.4.0 sass_0.4.7 rmarkdown_2.25
## [10] grid_4.4.0 evaluate_0.23 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.7 lifecycle_1.0.4
## [16] bookdown_0.36 BiocManager_1.30.22 compiler_4.4.0
## [19] codetools_0.2-19 Rcpp_1.0.11 BiocParallel_1.37.0
## [22] lattice_0.22-5 digest_0.6.33 R6_2.5.1
## [25] parallel_4.4.0 bslib_0.6.1 Matrix_1.6-4
## [28] tools_4.4.0 BiocGenerics_0.49.1 cachem_1.0.8