Features¶

Overview¶

We all know that doing data analysis day-to-day could easily turn into routine work and it is often hard to have fully reproducible code. Can you say for sure that you can redo your whole analysis only provided the raw data and your code? wBuild is designed to reduce the amount of time you spend to publish the output of your script, declare the needed input files, run Py code as a part of work pipeline, use placeholders to structure your Snakemake job, map your project’s scripts together and many more.

Demo project¶

It is highly recommended to see all of the examples of using the features in the demo project. There you also have additional documentation that explains the features and working with them!

Command-line interface¶

The command-line interface of wBuild is responsible only for preparing a project directory to be processed by snakemake and wBuild. There are three instructions, also shortly documented under wbuild -h

wbuild demo: Run demo project.

wbuild init: Initialize wBuild in an already existing project. This command prepares all important wrappers and files for Snakemake.
wbuild update: To be called on an already initialized project. Updates .wbuild directory to the latest version using installed Python wbuild package.

All these commands should be executed from the root directory of the project.

Snakemake CLI¶

Most of the job of building your project is done by Snakemake, as explained here. There are also several special Snakemake rules that wBuild provides. The most important include:

snakemake mapScripts: Do script mapping
snakemake publish: Publish your html output pages to your projectWebDir
snakemake clean: Deletes html output, generated dependencies file and Python cache.

snakemake restoreModDate: Restore previous modification date of all the files. Comes handy for pulling changes from VCS, where all the mod.dates get changed.

See more about this down the page.

Parsing YAML headers¶

In following, we present a basic YAML header:

#'---
#' title: Basic Input Demo
#' author: Leonhard Wachutka
#' wb:
#'  input:
#'  - iris: "Data/{wbP}/iris.RDS"
#'  output:
#'  - pca: " {wbPD_P}/pca.RDS"
#' type: script
#'---

wBuild requires users to define information of the scripts in RMarkdown YAML-format header. wBuild scans it and outputs rules for Snakemake. wb block is a “wBuild-own” one. Important tags here are input and output. These are used to costruct the snakemake pipeline, and render the script into an HTML format.

Tags that can be provided mainly follow the logic of Snakemake and partially that of wbuild.

Please note: YAML tags have a strict format that they should follow - e.g. there should be no tabs, only spaces! You can read more about the YAML syntax.

If you want to access information from the header of a script from within the script (code self-reflection), need to source .wBuild\wBuildParser.R and call parseWBHeader() with the path to your script as an argument.

Tags¶

To make working with R projects even more comfortable, there are a few additional YAML tags that wBuild provides. They are:

input: Specify any input files you would like to use. You can later access them from the R code using snakemake@input[[<input_file_var>]].
output: The same as input - accessed using snakemake@output.

py: This tag allows you to run some Python code during parsing of the header - a good example of how this feature can be extremely helpful is in the demo. Don’t forget the YAML pipe operator for the proper functionality!
type: Tag describing the type of the file. Can be: script for R Scripts, noindex for Markdown and empty for the rest.

The information stated under this tags is later synchronised with Snakemake.

One can also state Snakemake options in “wb” block of the YAML header and even refer to them in this R script later using snakemake@. Here, we mark that we will use 10 threads when executing this script:

#' wb:
#'  input:
#'  - iris: "Data/iris_downloaded.data"
#'  threads: 10

The specified thread variable can then be refered to by name in our R script: snakemake@threads

Snakemake special features¶

Use following addenda to snakemake CLI:

--dag

Construct the directed acyclic graph of the current snakemake workflow and display as svg.

There are also some special rules that are not getting executed as a part of the usual workflow which can be run separately. Consult .wBuild/wBuild.snakefile in your project to find out more.

Publishing the output¶

Snakemake renders your project, including script text and their outputs, to a nice viewable structure of HTML files. You can specify the output path by putting/changing the htmlOutputPath value inside the configuration file found in the root directory of your wBuild-initiated project. Your HTML gets output to Output/html by default.

There is also a way to automatically fetch your output to a webserver: typing snakemake publish copies the whole HTML output directory to the directory specified in projectWebDir parameter in the configuration file.

Markdown¶

No need to create a separate Markdown file to describe the analysis - with wBuild you can do it right in your render output using #' at the beginning of the line, an then just usual MD syntax!

Configuration file¶

wbuild.yaml file that is found in the root directory of the project stands for the configuration file of wBuild. In this file you can adjust various properties of wBuild workflow:

htmlOutputPath: This value specifies the relative path where your HTML output will land. More precisely, it is a prefix to output file of any Snakemake rule that is generated by wBuild. Default is Output/html.
processedDataPath: Relative path to the data output directory. Default is Output/ProcessedData
scriptsPath: Relative path to the root Scripts directory.
projectWebDir: Path to the output directory for snakemake publish.

IMPORTANT: Please, do not remove any key-value pairs from it or move this file unless you know what you are doing.

Placeholders¶

Placeholders provide the ability to refer to your current position in your system’s filepath with a pair of letters instead of absolute, relative paths. It’s best shown in an example:

#' wb:
#'  input:
#'  - iris: "Data/{wbP}/iris.RDS"
#'  output:
#'  - pca: " {wbPD_P}/pca.RDS"

Here, we use wbP for the name of the current project (say, Analysis01) and wbPD_P for the name of the output directory for processed data slash project name, say Output/ProcessedData/Analysis01.

Here is the conscise list of the placeholders:

wbPD: <output directory for processed data>, e.g. Output/ProcessedData
wbP: <current project>, e.g. Analysis1
wbPP: <subfolder name>, e.g. 020_InputOutput
wbPD_P: <output directory for processed data>/<current project>, e.g. Output/ProcessedData/Analysis1
wbPD_PP: <output directory for processed data>/<current project>/<subfolder name>, e.g. Output/ProcessedData/Analysis1/020_InputOutput

Script mapping¶

This advanced feature allows you to use the same script to analyse the similarly structured data as a part of various subprojects.

It all begins with a configure file scriptsMapping.wb in the root directory of your project. There, you put a YAML list of YAML formatted dictionaries with two keys:

src: A YAML list of file paths to create links from.
dst: A YAML list of directories paths to put file links into.

Running snakemake mapScripts then creates symbolic links for all the ‘src’ files in any of ‘dst’ directories.

Below is an example of a proper scriptsMapping.wb file:

- src:
  - _Template/preprocessData.R
  - _Template/PCAoutliers.R
  dst:
  - Principal_Analysis/allIntensities
  - Principal_Analysis/withoutFamilies
  - Principal_Analysis/withoutReplicates
  - Principal_Analysis/withoutReplicatesAndFamilies

Here, we map two scripts, preprocessData.R and PCAoutliers.R, to be in each of the four projects of Principal_Analysis. Placeholders then do their thing to speak to the right ProcessedData sub-directories, based on the current subproject.

HTML Subindex¶

For subdirectories under the Scripts/ directory you can also create a separate HTML index file. This is particularly useful when you have a larger, more modular workflow and you want to view the results of one module as soon as they have successfully finished.

In order to create a subindex, you need to create a new rule in your Snakefile.

Note

The subdirectory path has to be within the script directory so that all HTML pages get rendered correctly.

Here is an example from the Demo project.

from wbuild.createIndex import createIndexRule, ci

subdir = "Scripts/Analysis1/010_BasicInput/"
index_name = "Analysis1_BasicInput"
input, index_file, graph_file, _ = createIndexRule(scriptsPath=subdir, index_name=index_name)

rule subIndex:
    input: input
    output:
        index = index_file,
        graph = graph_file
    run:
        # 1. create the index file
        ci(subdir, index_name)
        # 2. create the dependency graph
        shell("snakemake --rulegraph {output.index} | dot -Tsvg -Grankdir=LR > {output.graph}")

The wbuild.createIndex.createIndexRule() function takes in the relative subdirectory path and an index name, which is prepended to the index HTML file. In this example, the HTML index file is called Analysis1_BasicInput_index.html under the htmlOutputPath. The function returns a list of all HTML output files, the index file name, the dependency graph file name and the readme HTML file name.

Using this information, you can assemble your rule, where the HTML file list is the input and the output is the index file name. You need to call the wbuild.createIndex.ci() function to write the index HTML file. You should also include the instructions to generate your dependency graph file. The standard way is to use the snakemake option --rulegraph to create a graph of all dependencies of the index file. This gives you a graphviz output that you can pipe into an the dependency graph file that you obtained from wbuild.createIndex.createIndexRule(). Optionally, you can also use the --dag option, which gives you the complete job graph.