--- title: "Describe CWL Tools/Workflows in R and Execution" output: BiocStyle::html_document: toc: true toc_depth: 4 number_sections: false highlight: haddock css: style.css --- ## Describe Tools in R The main interface provided by `sevenbridges` package is `Tool` function, it's basically a R interface similar to Seven Bridges's graphic user interface to describe tools, which I also highly recommend because it's very easy. For R users who want to script in everything here is alternative solution. I also did some work trying to make it simpler to use, any suggestions are welcomed. So I highly recommend user go over documentation [The Tool Editor](http://docs.cancergenomicscloud.org/docs/the-tool-editor) chapter for cancer genomic cloud to understand how it works, and even try it on the platform with the GUI. This will help use our R interface better and easier. Or you just want to create tools quickly, please keep reading. ### Example: give me some randome number Let's start from the most simple case - Use exiting docker container - directly execute from command line using `Rscript` Some basic arguments used in `Tool` function. - baseCommand: Specifies the program to execute. - stdout: Capture the command's standard output stream to a file written to the designated output directory. - inputs: inputs for your command line - outputs: outputs you want to collect For more accepted parameter, please check `help(CommandLineTool)` and `help(Tool)` for more details. Again we basically conform to CWL standard, you can always check their manual. #### Requirement and hints In short, hints are not _required_ for execution. We now accept following requirement items `cpu`, `mem`, `docker`, `fileDef`; and you can easily construct them via `requirements()` constructor. The downside of this is clear, you have much less control over your script and command line interface. #### Specify inputs and outpus Most likely your command line interface accept extra arguments, for example, - file - string - enum - int - float - boolean So to specify that in your tool, you can use `input` function, then pass it to the `inputs` arguments as a list or single item. You can even construct them as data.frame, but in that way, you usually have to provide equal arguments, less typing but less flexible. `input()` require arguments `id` and `type`. `output()` require arguments `id` because `type` by default is file. #### Specia type: ItemArray and enum The type could be an array of single type, the most common case is that if your input is a list of files, you can do something like `type = ItemArray("File")` or as simple as `type = "File..."` to diffenciate from a single file input. We also provide an enum type, when you specify the enum, please pass the required name and symbols like this `type = enum("format", c("pdf", "html"))` then in the interface you will be poped with drop down when you execute the task. For complete example, please check the "Make the best of bioconductor workflow" tutorial, or if you are advanced user, you can simply check `inst/docker` folder or check the github example [here](https://github.com/sbg/sevenbridges-r/tree/master/inst/docker/rnaseqGene) #### Using existing docker image and command If you already have a docker image in mind that provide the functionality you need, you can just use it. The `baseCommand` is the command line you want to execute in that container. `stdout` specify the output file you want to capture the standard output and collect it on the platform. ```{r} library(sevenbridges) rbx <- Tool(id = "runif", label = "runif", hints = requirements(docker(pull = "rocker/r-base"), cpu(1), mem(2000)), baseCommand = "Rscript -e 'runif(100)'", stdout = "output.txt", outputs = output(id = "random", glob = "*.txt")) rbx rbx$toJSON() ``` By default the tool object shows YAML, but you can simply convert it to JSON and copy it to your seven bridges platform graphic editor by importing JSON. ```{r} rbx$toJSON() rbx$toJSON(pretty = TRUE) rbx$toYAML() ``` #### Add-on script If you want to create simple script based on existing image, use `fileDef`. ```{r} ## Make a new file fd <- fileDef(name = "runif.R", content = "set.seed(1) runif(100)") ## or simply readLines .srcfile <- system.file("docker/sevenbridges/src/runif.R", package = "sevenbridges") fd <- fileDef(name = "runif.R", content = paste(readLines(.srcfile), collapse = "\n")) ## or read via reader library(readr) fd <- fileDef(name = "runif.R", content = read_file(.srcfile)) rbx <- Tool(id = "runif", label = "runif", hints = requirements(docker(pull = "rocker/r-base"), cpu(1), mem(2000)), requirements = requirements(fd), baseCommand = "Rscript runif.R", stdout = "output.txt", outputs = output(id = "random", glob = "*.txt")) ``` How bout multiple script? ```{r} ## or simply readLines .srcfile <- system.file("docker/sevenbridges/src/runif.R", package = "sevenbridges") fd1 <- fileDef(name = "runif.R", content = paste(readLines(.srcfile), collapse = "\n")) fd2 <- fileDef(name = "runif2.R", content = "set.seed(1) runif(100)") rbx <- Tool(id = "runif_twoscript", label = "runif_twoscript", hints = requirements(docker(pull = "rocker/r-base"), cpu(1), mem(2000)), requirements = requirements(fd1, fd2), baseCommand = "Rscript runif.R", stdout = "output.txt", outputs = output(id = "random", glob = "*.txt")) ``` #### Create formal input/output interface ```{r} ## pass a input list in.lst <- list(input(id = "number", description = "number of observations", type = "integer", label = "number", prefix = "--n", default = 1, required = TRUE, cmdInclude = TRUE), input(id = "min", description = "lower limits of the distribution", type = "float", label = "min", prefix = "--min", default = 0), input(id = "max", description = "upper limits of the distribution", type = "float", label = "max", prefix = "--max", default = 1), input(id = "seed", description = "seed with set.seed", type = "float", label = "seed", prefix = "--seed", default = 1)) ## the same method for outputs out.lst <- list(output(id = "random", type = "file", label = "output", description = "random number file", glob = "*.txt"), output(id = "report", type = "file", label = "report", glob = "*.html")) rbx <- Tool(id = "runif", label = "Random number generator", hints = requirements(docker(pull = "tengfei/runif"), cpu(1), mem(2000)), baseCommand = "runif.R", inputs = in.lst, ## or ins.df outputs = out.lst) ``` Here I use data.frame as example for input and output. ```{r} in.df <- data.frame(id = c("number", "min", "max", "seed"), description = c("number of observation", "lower limits of the distribution", "upper limits of the distribution", "seed with set.seed"), type = c("integer", "float", "float", "float"), label = c("number" ,"min", "max", "seed"), prefix = c("--n", "--min", "--max", "--seed"), default = c(1, 0, 10, 123), required = c(TRUE, FALSE, FALSE, FALSE)) out.df <- data.frame(id = c("random", "report"), type = c("file", "file"), glob = c("*.txt", "*.html")) rbx <- Tool(id = "runif", label = "Random number generator", hints = requirements(docker(pull = "tengfei/runif"), cpu(1), mem(2000)), baseCommand = "runif.R", inputs = in.df, ## or ins.df outputs = out.df) ``` #### Quick command line interface with commandArgs (position and named args) For advanced users, please read another tutorial "Creating Your Docker Container and Command Line Interface (with docopt)", "docopt" is more formal way to construct your command line interface, but there is a quick way to make command line interface here using just `commandArgs` Suppose I already have a R script like this using position mapping the arguments 1. numbers 2. min 3. max ```{r, eval = TRUE, comment=''} fl <- system.file("docker/sevenbridges/src", "runif2spin.R", package = "sevenbridges") cat(readLines(fl), sep = '\n') ``` Ignore the comment part, I will introduce spin/stich later. My base command will be somethine like ``` Rscript runif2spin.R 10 30 50 ``` I just describe my tool in this way ```{r} library(readr) fd <- fileDef(name = "runif.R", content = read_file(fl)) rbx <- Tool(id = "runif", label = "runif", hints = requirements(docker(pull = "rocker/r-base"), cpu(1), mem(2000)), requirements = requirements(fd), baseCommand = "Rscript runif.R", stdout = "output.txt", inputs = list(input(id = "number", type = "integer", position = 1), input(id = "min", type = "float", position = 2), input(id = "max", type = "float", position = 3)), outputs = output(id = "random", glob = "output.txt")) ``` Now copy-paste the json into your project app and run it in the cloud to test it How about named argumentments? I will still recommend use "docopt" package, but for simple way. ```{r, eval = TRUE, comment=''} fl <- system.file("docker/sevenbridges/src", "runif_args.R", package = "sevenbridges") cat(readLines(fl), sep = '\n') ``` ``` Rscript runif_args.R --n=10 --min=30 --max=50 ``` I just describe my tool in this way, note, I use `separate=FALSE` and add `=` to my prefix as a hack. ```{r} library(readr) fd <- fileDef(name = "runif.R", content = read_file(fl)) rbx <- Tool(id = "runif", label = "runif", hints = requirements(docker(pull = "rocker/r-base"), cpu(1), mem(2000)), requirements = requirements(fd), baseCommand = "Rscript runif.R", stdout = "output.txt", inputs = list(input(id = "number", type = "integer", separate = FALSE, prefix = "--n="), input(id = "min", type = "float", separate = FALSE, prefix = "--min="), input(id = "max", type = "float", separate = FALSE, prefix = "--max=")), outputs = output(id = "random", glob = "output.txt")) ``` #### Quick report: Spin and Stich Alternative, you can use spin/stich from knitr to generate report directly from a Rscript with special format. For example, let's use above example ```{r, eval = TRUE, comment=''} fl <- system.file("docker/sevenbridges/src", "runif_args.R", package = "sevenbridges") cat(readLines(fl), sep = '\n') ``` You command is something like this ``` Rscript -e "rmarkdown::render(knitr::spin('runif_args.R', FALSE))" --args --n=100 --min=30 --max=50 ``` And so I describe my tool like this with docker image `rocker/hadleyverse` this contians knitr and rmarkdown package. ```{r} library(readr) fd <- fileDef(name = "runif.R", content = read_file(fl)) rbx <- Tool(id = "runif", label = "runif", hints = requirements(docker(pull = "rocker/hadleyverse"), cpu(1), mem(2000)), requirements = requirements(fd), baseCommand = "Rscript -e \"rmarkdown::render(knitr::spin('runif.R', FALSE))\" --args", stdout = "output.txt", inputs = list(input(id = "number", type = "integer", separate = FALSE, prefix = "--n="), input(id = "min", type = "float", separate = FALSE, prefix = "--min="), input(id = "max", type = "float", separate = FALSE, prefix = "--max=")), outputs = list(output(id = "stdout", type = "file", glob = "output.txt"), output(id = "random", type = "file", glob = "*.csv"), output(id = "report", type = "file", glob = "*.html"))) ``` You will get a report in the end #### Inherit metadata and additional metadata Sometimes if you want your output files inherit from particular input file, just use `inheritMetadataFrom` in your output() call and pass the input file id. If you want to add additional metadata, you could pass `metadata` a list in your output() function call. For example, I want my output report inherit all metadata from my "bam_file" input node (which I don't have in this example though) with two additional metadata fields. ```{r} out.lst <- list(output(id = "random", type = "file", label = "output", description = "random number file", glob = "*.txt"), output(id = "report", type = "file", label = "report", glob = "*.html", inheritMetadataFrom = "bam_file", metadata = list(author = "tengfei", sample = "random"))) out.lst ``` #### Execute the tool in the cloud With API function, you can directly load your Tool into the account. Run a task, for "how-to", please check the API complete guide Following section, please for now skip. ```{r, eval = FALSE} a <- Auth(platform = "cgc", username = "tengfei") p <- a$project("demo") app.runif <- p$app_add("runif555", rbx) aid <- app.runif$id p$task_add(name = "Draft runif simple", description = "Description for runif", app = aid, inputs = list(min = 1, max = 10)) ## confirm, show all task status is draft (tsk <- p$task(status = "draft")) tsk$run() tsk$download("~/Downloads/") ``` #### Execute the tool in Rabix - test locally **1. from CLI** While developing tools it is useful to test them locally first. For that we can use rabix - reproducible analyses for bioinformatics, https://github.com/rabix. To test your tool with latest implementation of rabix in Java (called **bunny**) you could use docker image **tengfei/testenv**: ``` docker pull tengfei/testenv ``` Dump your rabix tool as json into dir which also contains input data. `write(rbx$toJSON, file="/.json")`. Make **inputs.json** file to declare input parameters in the same directory (you can use relative paths from inputs.json to data). Create container: ``` docker run --privileged --name bunny -v :/bunny_data -dit tengfei/testenv ``` Execute tool ``` docker exec bunny bash -c 'cd /opt/bunny && ./rabix.sh -e /bunny_data /bunny_data/.json /bunny_data/inputs.json' ``` You'll see running logs from within container, and also output dir inside in home system. NOTE: tengfei/testenv has R, python, Java... so many tools can work without docker requirement set. If you however set docker requirement you need to pull image inside container first to run docker container inside running bunny docker. NOTE: inputs.json can also be inputs.yaml if you find it easier to declare inputs in YAML. **2. from R** ```{r, eval=F} library(sevenbridges) in.df <- data.frame(id = c("number", "min", "max", "seed"), description = c("number of observation", "lower limits of the distribution", "upper limits of the distribution", "seed with set.seed"), type = c("integer", "float", "float", "float"), label = c("number" ,"min", "max", "seed"), prefix = c("--n", "--min", "--max", "--seed"), default = c(1, 0, 10, 123), required = c(TRUE, FALSE, FALSE, FALSE)) out.df <- data.frame(id = c("random", "report"), type = c("file", "file"), glob = c("*.txt", "*.html")) rbx <- Tool(id = "runif", label = "Random number generator", hints = requirements(docker(pull = "tengfei/runif"), cpu(1), mem(2000)), baseCommand = "runif.R", inputs = in.df, ## or ins.df outputs = out.df) params <- list(number=3, max=5) set_test_env("tengfei/testenv", "mount_dir") test_tool(rbx, params) ``` ## Describe Wokrflow in R __Graphic User Interface on Seven Bridges Platform is way more conventient__ ### Introduction To create a workflow, we provide simple interface to pipe your tool into a single workflow, it works under situation like - Simple linear tool connection - Output and input match at lest have one match, then we will connect it automatically. __Note__ for complicated workflow construction, I highly recommend using our graphical interface to do it, there is no better way. ### Example Here I will give a quick example 1. Tool 1 output 1000 random number 2. Tool 2 take log on it 3. Tool 3 do a mean calculation of everything Here are methods we support - Maximize the flexibility please use `Flow` constructor to construct your flow, then pass steps using `+` sign to connect your object. `+` only outputs `StepList` and support operation - Tool + Tool - StepList + Tool - StepList + StepList - Alternatively `%>>%` will output `Flow` not `StepList`, if you connect tools with `%>>%` we will create id for you. it always output `Workflow` - Tool %>>% Tool - Workflow %>>% Tool - Workflow %>>% Tool #### Construct tools first ```{r} library(sevenbridges) ## A tool that generate a 100 random number t1 <- Tool(id = "runif new test 3", label = "random number", hints = requirements(docker(pull = "rocker/r-base")), baseCommand = "Rscript -e 'x = runif(100); write.csv(x, file = 'random.txt', row.names = FALSE)'", outputs = output(id = "random", type = "file", glob = "random.txt")) ## A tool that take log fd <- fileDef(name = "log.R", content = "args = commandArgs(TRUE) x = read.table(args[1], header = TRUE)[,'x'] x = log(x) write.csv(x, file = 'random_log.txt', row.names = FALSE) ") t2 <- Tool(id = "log new test 3", label = "get log", hints = requirements(docker(pull = "rocker/r-base")), requirements = requirements(fd), baseCommand = "Rscript log.R", inputs = input(id = "number", type = "file"), outputs = output(id = "log", type = "file", glob = "*.txt")) ## A tool that do a mean fd <- fileDef(name = "mean.R", content = "args = commandArgs(TRUE) x = read.table(args[1], header = TRUE)[,'x'] x = mean(x) write.csv(x, file = 'random_mean.txt', row.names = FALSE) ") t3 <- Tool(id = "mean new test 3", label = "get mean", hints = requirements(docker(pull = "rocker/r-base")), requirements = requirements(fd), baseCommand = "Rscript mean.R", inputs = input(id = "number", type = "file"), outputs = output(id = "mean", type = "file", glob = "*.txt")) steplist <- t1 + t2 + t3 steplist ``` #### Connect tools to a flow To create a Flow we suggest you using `Flow` function, so that you can pass id and label to it. ```{r} f <- Flow(id = "Random-log-mean-new-test-2", label = "random log mean new test", steps = steplist) f$toJSON() f$toJSON(pretty = TRUE) ``` Or use %>>% ```{r} f <- t1 %>>% t2 %>>% t3 f$toJSON() ``` If it's the first time you upload those tools, following script will push both tools and the workflow to your project. ```{r, eval = FALSE} ## need id, full, sbg:id library(sevenbridges) a <- Auth(platform = "cgc", username = "tengfei") p <- a$project(id = "tengfei/helloworld") app.runif <- p$app_add("new_flow", f) aid <- app.runif$id p$task_add(name = "Draft flow", description = "Flow test", app = aid) ## confirm, show all task status is draft (tsk <- p$task('Draft flow')) tsk$run() ``` You can get App from your repos and connect it in R to build flow. ```{r, eval = FALSE} app1 <- p$app(id = "tengfei/helloworld/runif_new_test/0") app2 <- p$app(id = "tengfei/quickstart/log_new_test_3/0") app3 <- p$app(id = "tengfei/quickstart/log_new_test_3/0") f <- app1 %>>% app2 %>>% app3 f$id <- "randome number test flow 2" f$"sbg:id" <- "tengfei/helloworld/testtesttest" f$toJSON(pretty = TRUE) ``` ** Important ** This flow construction method is easy for linear flow, your best choice is always our graphic user interface and editor.