Command Program
Please read the Quick Start section in Home page first.
Pipelines are built with multiple Program
s. Program
is the abstract type contains CmdProgram
and JuliaProgram
.
A CmdProgram
contains a command template, lists of dependencies/inputs/outputs, and predefined methods to prepare run-time environment, validate dependencies/inputs/outputs, and so on.
An Example
We will go through an example to illustrate how to write a CmdProgram
.
The example is a robust Bowtie2 mapping program. It shows every feature of CmdProgram
, but in reality, we do not need that much validation.
Define the Name
program_bowtie2 = CmdProgram(
name = "Bowtie2 Mapping",# The name of CmdProgram
id_file = ".bowtie2" # When job completed, a file ".bowtie2.xxxxxx" will
# be created to indicate the job is finished to
# avoid re-run.
)
Main Command (Required)
In Bash script, the main code of the example:
REF=/path/to/reference_genome
FASTQ=/path/to/input_fastq_file
BAM=/path/to/output_bam_file
NTHREADS=8
bowtie2 --threads $NTHREADS -x $REF -q $FASTQ | samtools sort -O bam -o $BAM
The equivalent version using CmdProgram:
program_bowtie2 = CmdProgram(
...,
inputs = [
"FASTQ" => String,
"REF" => "human_genome_hg38.fa" => String,
:NTHREADS => Int => 8
],
outputs = "BAM" => String,
cmd = pipeline(`bowtie2 --threads NTHREADS -x REF -q FASTQ`, `samtools sort -O bam -o BAM`)
)
Now, the code can be run by invoking run(program_bowtie2; FASTQ = "x", REF = "y", NTHREADS = 8, BAM = "z")
, but to illustrate all features, we will add more things to make it robust and easy to use.
Normally the name should be a String
. However, if the argument does not affect results (such as number of threads), it needs to be a Symbol
. Symbol arguments are ignored when generating unique run IDs to prevent re-running a program. Arguments will be converted to Arg
objects.
Command Dependency (Robustness↑)
We use samtools
and bowtie2
(two bioinformatics programs) as command dependencies
. They can be wrapped in CmdDependency
.
Infer Outputs (Convenience↑)
We can set a default method to generate outputs::Dict{String}
from inputs, which allows us run the program without specifying outputs. Elements in inputs
can be directly used as variables.
program_bowtie2 = CmdProgram(
...,
outputs = "BAM" => String,
infer_outputs = quote
Dict("BAM" => FASTQ * ".bam")
end,
...
)
Quote is a piece of code of the type Expr
ession. See details in quote_expr
.
Or, the following does the same job:
program_bowtie2 = CmdProgram(
...,
outputs = "BAM" => "<FASTQ>.bam" => String,
...
)
replaceext
: replace file extension.removeext
: remove file extension.to_str
: converts most types toString
.to_cmd
: converts most types toCmd
.
More details are in the API/Utils page.
Validate Inputs (Robustness↑)
To make the code robust, we can check whether the inputs exists by using validate_inputs
. Elements in inputs
can be directly used as variables.
program_bowtie2 = CmdProgram(
...,
inputs = [
"FASTQ" => String,
"REF" => "human_genome_hg38.fa" => String
],
validate_inputs = quote
check_dependency_file(FASTQ) && check_dependency_file(REF)
end,
...
)
Prerequisites (Robustness↑)
We also need to prepare something (prerequisites
) before running the main command. For example, create the output directory if not exist. Elements in inputs
and outputs
can be directly used as variables.
program_bowtie2 = CmdProgram(
...,
prerequisites = quote
mkpath(dirname(BAM))
end
)
Validate Outputs (Robustness↑)
After running the main command, we can validate outputs by using validate_outputs
. Elements in outputs
can be directly used as variables.
program_bowtie2 = CmdProgram(
...,
outputs = "BAM",
validate_outputs = quote
check_dependency_file(BAM)
end,
...
)
Wrap Up (Convenience↑)
After validating outputs, we may also do something to wrap up, such as removing temporary files. Here, we build an index for output BAM file. Elements in inputs
and outputs
can be directly used as variables.
program_bowtie2 = CmdProgram(
...,
wrap_up = quote
run(`samtools index $BAM`) # dollar sign is necessary in quote, unlike Pipelines(;cmd = ...) cannot use dollar sign.
end
)
The Final Code
All in all, the final program is like this:
program_bowtie2 = CmdProgram(
name = "Bowtie2 Mapping",
id_file = ".bowtie2",
inputs = [
"FASTQ" => String,
"REF" => "human_genome_hg38.fa" => String
],
outputs = ["BAM" => String],
infer_outputs = quote
Dict("BAM" => FASTQ * ".bam")
end,
validate_inputs = quote
check_dependency_file(FASTQ) && check_dependency_file(REF)
end,
prerequisites = quote
mkpath(dirname(BAM))
end,
validate_outputs = quote
check_dependency_file(BAM)
end,
cmd = pipeline(`bowtie2 -x REF -q FASTQ`, `samtools sort -O bam -o BAM`), # do not use dollar sign here.
wrap_up = quote
run(`samtools index $BAM`) # unlike cmd = ..., dollar sign is necessary in all quotes!
end
)
Structure
CmdProgram
can be built with this method:
CmdProgram <: Program
CmdProgram(;
name::String = "Command Program",
id_file::String = "",
info_before::String = "auto",
info_after::String = "auto",
cmd_dependencies::Vector{CmdDependency} = Vector{CmdDependency}(),
inputs = Vector{String}(),
validate_inputs::Expr = do_nothing, # vars of inputs
infer_outputs::Expr = do_nothing, # vars of inputs
prerequisites::Expr = do_nothing, # vars of inputs and outputs
cmd::Base.AbstractCmd = ``,
outputs = Vector{String}(),
validate_outputs::Expr = do_nothing, # vars of outputs
wrap_up::Expr = do_nothing, # vars of inputs and outputs
arg_forward = Vector{Pair{String,Symbol}}(),
mod::Module = Pipelines # please change to @__MODULE__
) -> CmdProgram
In this way, all preparation and post-evaluation can be wrapped in a single CmdProgram
. It is easy to maintain and use.
Run
To run a Program
, use this method:
success, outputs = run(
p::Program;
program_kwargs...,
dir::AbstractString="",
check_dependencies::Bool=true,
skip_when_done::Bool=true,
touch_run_id_file::Bool=true,
verbose=true,
retry::Int=0,
dry_run::Bool=false,
stdout=nothing,
stderr=nothing,
stdlog=nothing,
append::Bool=false
) -> (success::Bool, outputs::Dict{String})
program_kwargs...
include elements inp.inputs
andp.outputs
- Other keyword arguments are related to run. Details can be found at
run
.
The old methods < v0.8 still work, which put program's arguments in inputs::Dict{String}
and outputs::Dict{String}
:
success, outputs = run(p::Program, inputs, outputs; run_kwargs...)
# only usable when outputs have default values.
success, outputs = run(p::Program, inputs; run_kwargs...)
Expr
ressions will be evaluated to functions in mod
. Please use mod = @__MODULE__
to prevent precompilation fail when defining the program within a package.
Redirecting and directory change in Julia are not thread safe, so unexpected redirection and directory change might be happen if you are running programs in different Tasks
or multi-thread mode.
Pipelines.jl is fully compatible with JobSchedulers.jl which is a Julia-based job scheduler and workload manager inspired by Slurm and PBS.
run(::Program, ...)
can be replaced by JobSchedulers.Job(::Program, ...)
. The latter creates a Job
, and you can submit the job to queue by using submit!(::Job)
.
arg_forward
(an argument of CmdProgram
) is used to forward user-defined inputs/outputs to specific keyword arguments of JobSchedulers.Job(::Program, ...)
, including name::String
, user::String
, ncpu::Int
, mem::Int
.
The explanation of arguments is in the next section.
Workflow
Go to the working directory. Establish redirection. (
dir
,stdout
,stderr
,stdlog
,append
).Check keywords consistency: Inputs/outputs keywords should be consistent in both
p::CmdProgram
andrun(p; inputs..., outputs...)
.For example, if inputs and outputs in
p::CmdProgram
is defined this wayp = CmdProgram(..., inputs = ["I", "J"], outputs = ["K"])
You have to provide all I, J and K:
run(p; I = something, J = something, K = something)
Print info about starting program.
The content can set by
p.info_before::String
.Disable:
run(..., verbose=false)
Check whether the program ran successfully before. If so, return
(true, outputs::Dict{String})
without running it.How does the program know it ran before?
a.
run(..., skip_when_done=true)
skip running the program if it has been done before.b. Run id file stores files information. File will be compared to determine re-run or not. You can use
run(..., touch_run_id_file=false)
to skip creating the run id file. Details of run id file can be found atPipelines.create_run_id_file
c.
p.validate_outputs(outputs)
run successfully without returningfalse
.Check command dependencies (
CmdDependency
).Disable:
run(..., check_dependencies=false)
Read Command Dependency portion for details.
Remove the run id file if exists.
Validate inputs. (
p.validate_inputs
)Preparing the main command.
If you specify
run(...; stdout=something, stderr=something, append::Bool)
, the command (cmd
) will be wrapped withpipeline(cmd; stdout=something, stderr=something, append::Bool)
. Ifcmd
has its own file redirection, the outer wrapper may not work as you expect.Meet prerequisites. (
p.prerequisites
)It is the last code before running the main command. For example, you can create a directory if the main command cannot create itself.
Run the main command.
Validate outputs. (
p.validate_outputs
)Run the wrap up code. (
p.wrap_up
)It is the last code to do after-command jobs. For example, you can delete intermediate files if necessary.
Create run id file if
run(..., touch_run_id_file=true)
. Read Step 4 for details.Print info about finishing program.
The content can set by
p.info_after::String
.Disable:
run(..., verbose=false)
Simple info:
run(..., verbose=min)
Return
(success::Bool, outputs{String})
run(..., dry_run=true)
will return (mature_command::AbstractCmd, run_id_file::String)
instead.