Features¶
Overview¶
We all know that doing data analysis day-to-day could easily turn into routine work and it is often hard to have fully reproducible code. Can you say for sure that you can redo your whole analysis only provided the raw data and your code? wBuild is designed to reduce the amount of time you spend to publish the output of your script, declare the needed input files, run Py code as a part of work pipeline, use placeholders to structure your Snakemake job, map your project’s scripts together and many more.
Demo project¶
It is highly recommended to see all of the examples of using the features in the demo project. There you also have additional documentation that explains the features and working with them!
Command-line interface¶
The command-line interface of wBuild is responsible only for preparing a project directory to be processed by snakemake and wBuild. There are three instructions, also shortly
documented under wbuild -h
wbuild demo
- Run demo project.
wbuild init
- Initialize wBuild in an already existing project. This command prepares all important wrappers and files for Snakemake.
wbuild update
- To be called on an already initialized project. Updates
.wbuild
directory to the latest version using installed Pythonwbuild
package.
All these commands should be executed from the root directory of the project.
Snakemake CLI¶
Most of the job of building your project is done by Snakemake, as explained here. There are also several special Snakemake rules that wBuild provides. The most important include:
- snakemake mapScripts
- Do script mapping
- snakemake publish
- Publish your html output pages to your projectWebDir
- snakemake clean
- Deletes html output, generated dependencies file and Python cache.
- snakemake restoreModDate
- Restore previous modification date of all the files. Comes handy for pulling changes from VCS, where all the mod.dates get changed.
See more about this down the page.
Parsing YAML headers¶
In following, we present a basic YAML header:
#'---
#' title: Basic Input Demo
#' author: Leonhard Wachutka
#' wb:
#' input:
#' - iris: "Data/{wbP}/iris.RDS"
#' output:
#' - pca: " {wbPD_P}/pca.RDS"
#' type: script
#'---
wBuild requires users to define information of the scripts in RMarkdown YAML-format header.
wBuild scans it and outputs rules for Snakemake. wb
block is a “wBuild-own” one.
Important tags here are input and output. These are used to costruct the snakemake pipeline,
and render the script into an HTML format.
Tags that can be provided mainly follow the logic of Snakemake and partially that of wbuild.
Please note: YAML tags have a strict format that they should follow - e.g. there should be no tabs, only spaces! You can read more about the YAML syntax.
If you want to access information from the header of a script from within the script (code self-reflection), need to source .wBuild\wBuildParser.R
and call
parseWBHeader()
with the path to your script as an argument.
Tags¶
To make working with R projects even more comfortable, there are a few additional YAML tags that wBuild provides. They are:
- input
- Specify any input files you would like to use. You can later access them from the R code using
snakemake@input[[<input_file_var>]]
. - output
- The same as input - accessed using
snakemake@output
.
- py
- This tag allows you to run some Python code during parsing of the header - a good example of how this feature can be extremely helpful is in the demo. Don’t forget the YAML pipe operator for the proper functionality!
- type
- Tag describing the type of the file. Can be:
script
for R Scripts,noindex
for Markdown andempty
for the rest.
The information stated under this tags is later synchronised with Snakemake.
#' wb:
#' input:
#' - iris: "Data/iris_downloaded.data"
#' threads: 10
The specified thread variable can then be refered to by name in our R script: snakemake@threads
Snakemake special features¶
Use following addenda to snakemake
CLI:
--dag | Construct the directed acyclic graph of the current snakemake workflow and display as svg. |
There are also some special rules that are not getting executed as a part of the usual workflow which can be run separately. Consult
.wBuild/wBuild.snakefile
in your project to find out more.
Publishing the output¶
Snakemake renders your project, including script text and their outputs, to a nice viewable structure of HTML files. You can
specify the output path by putting/changing the htmlOutputPath value inside the configuration file found
in the root directory of your wBuild-initiated project. Your HTML gets output to Output/html
by default.
There is also a way to automatically fetch your output to a webserver: typing snakemake publish
copies the whole HTML output directory
to the directory specified in projectWebDir parameter in the configuration file.
Markdown¶
No need to create a separate Markdown file to describe the analysis - with wBuild you can do it right in your render
output using #'
at the beginning of the line, an then just usual MD syntax!
Configuration file¶
wbuild.yaml
file that is found in the root directory of the project stands for the configuration file of wBuild.
In this file you can adjust various properties of wBuild workflow:
- htmlOutputPath
- This value specifies the relative path where your HTML output will land. More precisely, it is a prefix to output file
of any Snakemake rule that is generated by wBuild. Default is
Output/html
. - processedDataPath
- Relative path to the data output directory. Default is
Output/ProcessedData
- scriptsPath
- Relative path to the root Scripts directory.
- projectWebDir
- Path to the output directory for
snakemake publish
.
IMPORTANT: Please, do not remove any key-value pairs from it or move this file unless you know what you are doing.
Placeholders¶
Placeholders provide the ability to refer to your current position in your system’s filepath with a pair of letters instead of absolute, relative paths. It’s best shown in an example:
#' wb:
#' input:
#' - iris: "Data/{wbP}/iris.RDS"
#' output:
#' - pca: " {wbPD_P}/pca.RDS"
Here, we use wbP
for the name of the current project (say, Analysis01) and wbPD_P
for the name of the
output directory for processed data slash project name, say Output/ProcessedData/Analysis01
.
Here is the conscise list of the placeholders:
- wbPD
- <output directory for processed data>, e.g.
Output/ProcessedData
- wbP
- <current project>, e.g.
Analysis1
- wbPP
- <subfolder name>, e.g.
020_InputOutput
- wbPD_P
- <output directory for processed data>/<current project>, e.g.
Output/ProcessedData/Analysis1
- wbPD_PP
- <output directory for processed data>/<current project>/<subfolder name>, e.g.
Output/ProcessedData/Analysis1/020_InputOutput
Script mapping¶
This advanced feature allows you to use the same script to analyse the similarly structured data as a part of various subprojects.
It all begins with a configure file scriptsMapping.wb
in the root directory of your project. There, you put a YAML list of YAML formatted dictionaries with two keys:
- src
- A YAML list of file paths to create links from.
- dst
- A YAML list of directories paths to put file links into.
Running snakemake mapScripts
then creates symbolic links for all the ‘src’ files in any of ‘dst’ directories.
Below is an example of a proper scriptsMapping.wb
file:
- src:
- _Template/preprocessData.R
- _Template/PCAoutliers.R
dst:
- Principal_Analysis/allIntensities
- Principal_Analysis/withoutFamilies
- Principal_Analysis/withoutReplicates
- Principal_Analysis/withoutReplicatesAndFamilies
Here, we map two scripts, preprocessData.R
and PCAoutliers.R
, to be in each of the four projects of Principal_Analysis
. Placeholders then do their thing to speak to the right ProcessedData
sub-directories, based on the current subproject.
HTML Subindex¶
For subdirectories under the Scripts/
directory you can also create a separate HTML index file.
This is particularly useful when you have a larger, more modular workflow and you want to view the results of one module
as soon as they have successfully finished.
In order to create a subindex, you need to create a new rule in your Snakefile
.
Note
The subdirectory path has to be within the script directory so that all HTML pages get rendered correctly.
Here is an example from the Demo project.
from wbuild.createIndex import createIndexRule, ci
subdir = "Scripts/Analysis1/010_BasicInput/"
index_name = "Analysis1_BasicInput"
input, index_file, graph_file, _ = createIndexRule(scriptsPath=subdir, index_name=index_name)
rule subIndex:
input: input
output:
index = index_file,
graph = graph_file
run:
# 1. create the index file
ci(subdir, index_name)
# 2. create the dependency graph
shell("snakemake --rulegraph {output.index} | dot -Tsvg -Grankdir=LR > {output.graph}")
The wbuild.createIndex.createIndexRule()
function takes in the relative subdirectory path and an index name,
which is prepended to the index HTML file.
In this example, the HTML index file is called Analysis1_BasicInput_index.html
under the htmlOutputPath
.
The function returns a list of all HTML output files, the index file name, the dependency graph file name and the
readme HTML file name.
Using this information, you can assemble your rule, where the HTML file list is the input and the output is the index
file name.
You need to call the wbuild.createIndex.ci()
function to write the index HTML file.
You should also include the instructions to generate your dependency graph file.
The standard way is to use the snakemake option --rulegraph
to create a graph of all dependencies of the index file.
This gives you a graphviz
output that you can pipe into an the dependency graph file that you obtained from
wbuild.createIndex.createIndexRule()
.
Optionally, you can also use the --dag
option, which gives you the complete job graph.