# 1 Introduction

The SharedObject package is designed for sharing data across multiple R processes, where all processes can read the data located in the same memory location. This sharing mechanism has the potential to save the memory usage and reduce the overhead of data transmission in parallel computing. The use of the package arises from many data-science subjects such as high-throughput gene data analysis, in which case the data is very large and a parallel computing is desirable. Blindly exporting data to all R processes via functions such as clusterExport will duplicate the data for each process and it is obviously unnecessary if other processes just need to read the data. The sharedObject package can share the data without duplications and is able to reduce the time cost. A new set of R APIs called ALTREP is used to provide a seamless experience when sharing an object.

# 2 Quick example

We first demonstrate the package with an example. In this example, we create a cluster with 4 cores and share an n-by-n matrix A, we use the function share to create a shared object shared_A and call the function clusterExport to export it:

library(parallel)
## Initiate the cluster
cl <- makeCluster(1)
## create data
n <- 3
A <- matrix(runif(n^2), n, n)
## create a shared object
shared_A <- share(A)
## export the shared object
clusterExport(cl,"shared_A")

stopCluster(cl)

As the code shows above, the procedure of exporting a shared object to the other R processes is similar to the procedure of exporting a regular R object, except that we replace the matrix A with a shared object shared_A. Notably, there is no different between the matrix A and the shared object shared_A. The shared object shared_A is neither an S3 nor S4 object and it behaves exactly the same as the matrix A, so there is no need to change the existing code to work with the shared object. We can verify this through

## check the data
A
#>           [,1]      [,2]      [,3]
#> [1,] 0.4633210 0.6513251 0.8640785
#> [2,] 0.8590686 0.8625660 0.2799991
#> [3,] 0.2806280 0.3378297 0.8260773
shared_A
#>           [,1]      [,2]      [,3]
#> [1,] 0.4633210 0.6513251 0.8640785
#> [2,] 0.8590686 0.8625660 0.2799991
#> [3,] 0.2806280 0.3378297 0.8260773
## check the class
class(A)
#> [1] "matrix" "array"
class(shared_A)
#> [1] "matrix" "array"
## idential
identical(A, shared_A)
#> [1] TRUE

Users can treate the shared object shared_A as a matrix and do operations on it as usual. For reducing the unnecessary creation of a shared object, the subset of a shared object is a regular R object. Users can verify this by calling is.shared

## shared_A is a shared object
is.shared(shared_A)
#> [1] TRUE

## The subset of shared_A is not
is.shared(shared_A[1:2])
#> [1] FALSE

This behavior, however, can be altered via the argument sharedSubset. Therefore, if a shared object shared_A is made by share(A, sharedSubset = TRUE), then all the subsets of the object shared_A will be shared objects automatically.

# 3 Supported data types and structures

Currently, the package supports raw, logical, integer and double data types. character is not supported. Sharing the data structure atomic(aka vector), matrix, data.frame and list is available. The function share is an S4 generic, developers can define an S4 share function to support their own data structures.

Please note that sharing a list object will not sharing the list itself, but sharing each element of the list object instead. Therefore, adding or replace an element on a shared list in the main process will not implicitly change the shared list on the other processes. Since a data frame is fundamentally a list object, sharing a data frame will follow the same principle as sharing a list.

When a list consists of both sharable and non-sharable objects, noError argument can be used in the share function to share the sharable elements and keep the non-sharable elements same(Otherwise an error will be shown). Alternatively, the function tryShare is a shortcut for share(..., noError = TRUE)

## the element a is sharable and b is not
mydata <- list(a = 1:3, b = letters[1:3])

## Will get an error if we directly share the object
## share(mydata)

## Use the noError argument to suppress the error message
sharedList1 <- share(mydata, noError = TRUE)

## Use the function tryShare
## this is equivalent to share(mydata, noError = TRUE)
sharedList2 <- tryShare(mydata)

## Only the element a is a shared object
is.shared(sharedList1)
#> $a #> [1] TRUE #> #>$b
#> [1] FALSE

# 4 Check object information

In order to distinguish a shared object, the package provide is.shared function to identify a shared object

## Check if an object is a shared object
## This works for both vector and data.frame
is.shared(A)
#> [1] FALSE
is.shared(shared_A)
#> [1] TRUE

For an atomic object, is.shared returns a logical value indicating whether the object is a shared object. For a list object, it returns a list of logical values with each element representing the corresponding element in the list object.

There are several properties associated with the shared object, one can check them via

## get a summary report
getSharedObjectProperty(shared_A)
#> $dataId #> [1] 1 #> #>$length
#> [1] 9
#>
#> $totalSize #> [1] 72 #> #>$dataType
#> [1] 14
#>
#> $ownData #> [1] TRUE #> #>$copyOnWrite
#> [1] TRUE
#>
#> $sharedSubset #> [1] FALSE #> #>$sharedCopy
#> [1] FALSE

## Internal function to check the properties
## get the individual properties
getCopyOnWrite(shared_A)
#> [1] TRUE
getSharedSubset(shared_A)
#> [1] FALSE
getSharedCopy(shared_A)
#> [1] FALSE

Please see the advanced topic to see the meaning of the properties and how to set them in a proper way.

# 5 Global options

There are some options that can control the default behavior of a shared object, you can view them via

getSharedObjectOptions()
#> $copyOnWrite #> [1] TRUE #> #>$sharedSubset
#> [1] FALSE
#>
#> $noError #> [1] FALSE #> #>$sharedCopy
#> [1] FALSE

As beforementioned, the option sharedSubset controls whether the subset of a shared object is still a shared object. The option noError suppress the error message when the function share encounter a non-sharable object and force the function to return the same object. We will talk about the options copyOnWrite and sharedCopy in the advanced section, but for most users these two options should not be changed. The global setting can be modified via setSharedObjectOptions

## change the default setting
setSharedObjectOptions(noError = TRUE)
## Check if the change is made
getSharedObjectOptions("noError")
#> [1] TRUE
#>
#> $sharedCopy #> [1] TRUE ## Changing the value of shared_A will not ## result in a regular R object shared_A2 <- shared_A shared_A[1,1] <- 10 is.shared(shared_A) #> [1] TRUE Please note that sharedCopy is only available when copyOnWrite = TRUE. # 7 Developing package based upon SharedObject The package offers three levels of APIs to help the package developers to build their own shared memory object. ## 7.1 user API The simplest and recommended way to make your own shared object is to define an S4 function share in your own package, where you can rely on the existing share functions to quickly add the support of an S4 class which is not provided by SharedObject. We recommend to use this method to build your package for the simple reason that the developers do not have to control the lifecycle of the shared memory. The package will automatically destroy the shared memory after usage. ## 7.2 R memory management APIs It is a common request to have a low level control to the shared memory. To achieve that, the package exports some R APIs for the developers who want to have a fine control of their shared objects. These functions are allocateSharedMemory, allocateNamedSharedMemory, mapSharedMemory, unmapSharedMemory, freeSharedMemory, hasSharedMemory and getSharedMemorySize. Note that developers are responsible for destroying the shared memory after usage. Please see the function documentation for more infomation ## 7.3 C++ memory management APIs For the most sophisticated package developers, it might be more comfortable to use the C++ APIs rather than R APIs. All the R APIs mentioned in the previous section can be found at C++ level. Here I will borrow the instruction from Rhtslib package with slight changes to show how to use the C++ APIs in your package. ### 7.3.1 Step 1 For using the C++ APIS, you must add SharedObject to the LinkingTo field of the DESCRIPTION file, e.g., LinkingTo: SharedObject ### 7.3.2 Step 2 In C++ code files, include the header of the shared object #include "SharedObject/sharedMemory.h". ### 7.3.3 Step 3 To compile and link your package successfully against the SharedObject C++ library, you must include a src/Makevars file. SHARED_OBJECT_LIBS =$(shell echo 'SharedObject:::pkgconfig("PKG_LIBS")'|\
"${R_HOME}/bin/R" --vanilla --slave) SHARED_OBJECT_CPPFLAGS =$(shell echo 'SharedObject:::pkgconfig("PKG_CPPFLAGS")'|\
"${R_HOME}/bin/R" --vanilla --slave) PKG_LIBS :=$(PKG_LIBS) $(SHARED_OBJECT_LIBS) PKG_CPPFLAGS :=$(PKG_CPPFLAGS) $(SHARED_OBJECT_CPPFLAGS) Note that $(shell ...) is GNU make syntax so you should add GNU make to the SystemRequirements field of the DESCRIPTION file of your package, e.g.,

SystemRequirements: GNU make

You can find short descriptions on how to use the C++ API in the header file.

# 8 Session Information

sessionInfo()
#> R version 4.0.0 alpha (2020-03-31 r78116)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#>
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
#>
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel  stats     graphics  grDevices utils     datasets  methods
#> [8] base
#>
#> other attached packages:
#> [1] SharedObject_1.1.2 BiocStyle_2.15.6
#>
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.4          bookdown_0.18       digest_0.6.25
#>  [4] magrittr_1.5        evaluate_0.14       rlang_0.4.5
#>  [7] stringi_1.4.6       rmarkdown_2.1       tools_4.0.0
#> [10] stringr_1.4.0       xfun_0.12           yaml_2.2.1
#> [13] compiler_4.0.0      BiocGenerics_0.33.3 BiocManager_1.30.10
#> [16] htmltools_0.4.0     knitr_1.28