Advanced tutorialΒΆ

Here we have a more advanced bioinformatics pipeline that adds some new concepts. This is a simple script that takes an input file and returns the file size and the number of sequencing reads in that file. This example uses a function from from the built-in NGSTk toolkit. In particular, this toolkit contains a few handy functions that make it easy for a pipeline to accept inputs of various types. So, this pipeline can count the number of reads from files in BAM format, or fastq format, or fastq.gz format. You can also use the same functions from NGSTk to develop a pipeline to do more complicated things, and handle input of any of these types.

First, grab this pipeline. Download, make it executable (chmod 755, and then run it with ./

You can grab a few small data files in the microtest repository. Run a few of these files like this:

./ -I ~/code/microtest/data/rrbs_PE_R1.fastq.gz -O $HOME -S sample1
./ -I ~/code/microtest/data/rrbs_PE_fq_R1.fastq -O $HOME -S sample2
./ -I ~/code/microtest/data/atac-seq_SE.bam -O $HOME -S sample3

This example is a documented vignette; so just read it and run it to get an idea of how things work.

#!/usr/bin/env python

Counts reads.

__author__ = "Nathan Sheffield"
__email__ = ""
__license__ = "GPL3"
__version__ = "0.1"

from argparse import ArgumentParser
import os, re
import sys
import subprocess
import yaml
import pypiper

parser = ArgumentParser(
    description="A pipeline to count the number of reads and file size. Accepts"
    " BAM, fastq, or fastq.gz files.")

# First, add standard arguments from Pypiper.
# groups="pypiper" will add all the arguments that pypiper uses,
# and adding "common" adds arguments for --input and --sample--name
# and "output_parent". You can read more about your options for standard
# arguments in the pypiper docs (section "command-line arguments")
parser = pypiper.add_pypiper_args(parser, groups=["pypiper", "common", "ngs"],
                                    args=["output-parent", "config"],
                                    required=['sample-name', 'output-parent'])

# Add any pipeline-specific arguments if you like here.

args = parser.parse_args()

if not args.input or not args.output_parent:
    raise SystemExit

if args.single_or_paired == "paired":
    args.paired_end = True
    args.paired_end = False

# args for `output_parent` and `sample_name` were added by the standard 
# `add_pypiper_args` function. 
# A good practice is to make an output folder for each sample, housed under
# the parent output folder, like this:
outfolder = os.path.abspath(os.path.join(args.output_parent, args.sample_name))

# Create a PipelineManager object and start the pipeline
pm = pypiper.PipelineManager(name="count",

# NGSTk is a "toolkit" that comes with pypiper, providing some functions
# for dealing with genome sequence data. You can read more about toolkits in the
# documentation

# Create a ngstk object
ngstk = pypiper.NGSTk(pm=pm)

raw_folder = os.path.join(outfolder, "raw/")
fastq_folder = os.path.join(outfolder, "fastq/")

# Merge/Link sample input and Fastq conversion
# These commands merge (if multiple) or link (if single) input files,
# then convert (if necessary, for bam, fastq, or gz format) files to fastq.

# We'll start with a timestamp that will provide a division for this section
# in the log file
pm.timestamp("### Merge/link and fastq conversion: ")

# Now we'll rely on 2 NGSTk functions that can handle inputs of various types
# and convert these to fastq files.

local_input_files = ngstk.merge_or_link(
                        [args.input, args.input2],

cmd, out_fastq_pre, unaligned_fastq = ngstk.input_to_fastq(

# Now we'll use another NGSTk function to grab the file size from the input files
pm.report_result("File_mb", ngstk.get_file_size(local_input_files))

# And then count the number of reads in the file

n_input_files = len(filter(bool, local_input_files))

raw_reads = sum([int(ngstk.count_reads(input_file, args.paired_end)) 
                for input_file in local_input_files]) / n_input_files

# Finally, we use the report_result() function to print the output and 
# log the key-value pair in the standard stats.tsv file
pm.report_result("Raw_reads", str(raw_reads))

# Cleanup