BiocNeighbors 1.21.2
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 3015 4384 4696 3450 1505 3955 1962 3551 5277 7093
## [2,] 397 5322 2177 3703 9722 6857 3372 828 5178 2914
## [3,] 5397 5818 6738 9131 5548 8078 8678 5437 6376 4808
## [4,] 3463 8749 6409 7166 1714 4153 3234 7073 5505 837
## [5,] 9744 6985 8646 2594 5398 4185 7497 5317 1666 9247
## [6,] 59 696 3283 1202 5466 9526 1198 3694 7064 3665
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9341923 0.9527923 0.9647438 0.9769915 0.9836699 0.9871061 1.0030521
## [2,] 0.8119189 0.8634843 0.8650023 0.9214678 0.9227217 0.9279625 0.9305518
## [3,] 0.7730708 0.8406392 0.9079637 0.9366356 0.9759469 1.0036047 1.0226767
## [4,] 1.0311872 1.0526839 1.0882136 1.1277175 1.1401669 1.1420376 1.1517972
## [5,] 0.9501674 0.9572392 0.9841415 1.0178292 1.0300089 1.0811498 1.0919371
## [6,] 0.9044073 0.9194801 0.9752643 0.9846420 0.9871848 1.0118369 1.0177381
## [,8] [,9] [,10]
## [1,] 1.0335310 1.033556 1.0336419
## [2,] 0.9337065 0.947928 0.9490606
## [3,] 1.0251433 1.034085 1.0367147
## [4,] 1.1562366 1.169030 1.1857411
## [5,] 1.1265150 1.140092 1.1520551
## [6,] 1.0479219 1.067483 1.0843154
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 5418 3697 1072 3493 219
## [2,] 7131 9200 3955 8575 372
## [3,] 4945 5286 5409 7741 382
## [4,] 9611 9366 9124 5016 3401
## [5,] 7495 3555 8239 5163 4434
## [6,] 896 1822 6077 7666 891
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9707904 1.0662892 1.0664977 1.0710768 1.0729846
## [2,] 0.9964840 0.9966896 1.0007871 1.0154825 1.0182202
## [3,] 0.9536263 0.9846544 0.9889048 1.0100797 1.0276216
## [4,] 0.8351753 0.8431612 0.8885603 0.9292682 1.0337896
## [5,] 0.9119824 0.9192331 0.9487250 0.9638794 0.9803079
## [6,] 0.8751767 0.9048310 0.9093602 0.9205429 0.9367550
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpkbI6iO/file13e4c51b84413.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.4.0 Patched (2024-04-24 r86482)
## Platform: aarch64-apple-darwin20
## Running under: macOS Ventura 13.6.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.21.2 knitr_1.46 BiocStyle_2.31.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.2 rlang_1.1.3 xfun_0.43
## [4] jsonlite_1.8.8 S4Vectors_0.41.7 htmltools_0.5.8.1
## [7] stats4_4.4.0 sass_0.4.9 rmarkdown_2.26
## [10] grid_4.4.0 evaluate_0.23 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.8 lifecycle_1.0.4
## [16] bookdown_0.39 BiocManager_1.30.22 compiler_4.4.0
## [19] codetools_0.2-20 Rcpp_1.0.12 BiocParallel_1.37.1
## [22] lattice_0.22-6 digest_0.6.35 R6_2.5.1
## [25] parallel_4.4.0 bslib_0.7.0 Matrix_1.7-0
## [28] tools_4.4.0 BiocGenerics_0.49.1 cachem_1.0.8