API

pypiper.PipelineManager

Pypiper is a python module with two components: 1) the PipelineManager class, and 2) other toolkits (currently just NGSTk) with functions for more specific pipeline use-cases. The PipelineManager class can be used to create a procedural pipeline in python.

class pypiper.manager.PipelineManager(name, outfolder, version=None, args=None, multi=False, manual_clean=False, recover=False, fresh=False, force_follow=False, cores=1, mem='1000', config_file=None, output_parent=None, overwrite_checkpoints=False, **kwargs)[source]

Base class for instantiating a PipelineManager object, the main class of Pypiper.

Parameters:
  • name (str) – Choose a name for your pipeline; it’s used to name the output files, flags, etc.
  • outfolder (str) – Folder in which to store the results.
  • args (argparse.Namespace) – Optional args object from ArgumentParser; Pypiper will simply record these arguments from your script
  • multi (bool) – Enables running multiple pipelines in one script or for interactive use. It simply disables the tee of the output, so you won’t get output logged to a file.
  • manual_clean (bool) – Overrides the pipeline’s clean_add() manual parameters, to never clean up intermediate files automatically. Useful for debugging; all cleanup files are added to manual cleanup script.
  • recover (bool) – Specify recover mode, to overwrite lock files. If pypiper encounters a locked target, it will ignore the lock and recompute this step. Useful to restart a failed pipeline.
  • fresh (bool) – NOT IMPLEMENTED
  • force_follow (bool) – Force run all follow functions even if the preceding command is not run. By default, following functions are only run if the preceding command is run.
  • cores (int) – number of processors to use, default 1
  • mem (str) – amount of memory to use, in Mb
  • config_file (str) – path to pipeline configuration file, optional
  • output_parent (str) – path to folder in which output folder will live
  • overwrite_checkpoints (bool) – Whether to override the stage-skipping logic provided by the checkpointing system. This is useful if the calls to this manager’s run() method will be coming from a class that implements pypiper.Pipeline, as such a class will handle checkpointing logic automatically, and will set this to True to protect from a case in which a restart begins upstream of a stage for which a checkpoint file already exists, but that depends on the upstream stage and thus should be rerun if it’s “parent” is rerun.
Raises:

TypeError – if start or stop point(s) are provided both directly and via args namespace, or if both stopping types (exclusive/prospective and inclusive/retrospective) are provided.

atexit_register(*args)[source]

Convenience alias to register exit functions without having to import atexit in the pipeline.

callprint(cmd, shell='guess', nofail=False, container=None, lock_name=None, errmsg=None)[source]

Prints the command, and then executes it, then prints the memory use and return code of the command.

Uses python’s subprocess.Popen() to execute the given command. The shell argument is simply passed along to Popen(). You should use shell=False (default) where possible, because this enables memory profiling. You should use shell=True if you require shell functions like redirects (>) or pipes (|), but this will prevent the script from monitoring memory use.

cmd can also be a series (a dict object) of multiple commands, which will be run in succession.

Parameters:
  • cmd (str or list) – Bash command(s) to be run.
  • shell (bool) – If command is required to be run in its own shell. Optional. Default: “guess”, which will make a best guess on whether it should run in a shell or not, based on presence of shell utils, like asterisks, pipes, or output redirects. Force one way or another by specifying True or False
  • nofail (bool) – Should the pipeline bail on a nonzero return from a process? Default: False Nofail can be used to implement non-essential parts of the pipeline; if these processes fail, they will not cause the pipeline to bail out.
  • container – Named Docker container in which to execute.
  • container – str
  • lock_name (str) – Name of the relevant lock file.
  • errmsg (str) – Message to print if there’s an error.
checkprint(cmd, shell='guess', nofail=False, errmsg=None)[source]

Just like callprint, but checks output – so you can get a variable in python corresponding to the return value of the command you call. This is equivalent to running subprocess.check_output() instead of subprocess.call(). :param cmd: Bash command(s) to be run. :type cmd: str or list :param shell: If command requires should be run in its own shell. Optional. Default: “guess” – run() will try to guess if the command should be run in a shell (based on the presence of a pipe (|) or redirect (>), To force a process to run as a direct subprocess, set shell to False; to force a shell, set True. :type shell: bool :param nofail: Should the pipeline bail on a nonzero return from a process? Default: False Nofail can be used to implement non-essential parts of the pipeline; if these processes fail, they will not cause the pipeline to bail out. :type nofail: bool :param errmsg: Message to print if there’s an error. :type errmsg: str

clean_add(regex, conditional=False, manual=False)[source]

Add files (or regexs) to a cleanup list, to delete when this pipeline completes successfully. When making a call with run that produces intermediate files that should be deleted after the pipeline completes, you flag these files for deletion with this command. Files added with clean_add will only be deleted upon success of the pipeline.

Parameters:
  • regex (str) – A unix-style regular expression that matches files to delete (can also be a file name).
  • conditional (bool) – True means the files will only be deleted if no other pipelines are currently running; otherwise they are added to a manual cleanup script called {pipeline_name}_cleanup.sh
  • manual (bool) – True means the files will just be added to a manual cleanup script.
complete()[source]

Stop a completely finished pipeline.

completed

Is the managed pipeline in a completed state?

Return bool:Whether the managed pipeline is in a completed state.
fail_pipeline(e, dynamic_recover=False)[source]

If the pipeline does not complete, this function will stop the pipeline gracefully. It sets the status flag to failed and skips the normal success completion procedure.

Parameters:
  • e (Exception) – Exception to raise.
  • dynamic_recover (bool) – Whether to recover e.g. for job termination.
failed

Is the managed pipeline in a failed state?

Return bool:Whether the managed pipeline is in a failed state.
flag_file_path(status=None)[source]

Create path to flag file based on indicated or current status.

Internal variables used are the pipeline name and the designated pipeline output folder path.

Parameters:status (str) – flag file type to create, default to current status
Return str:path to flag file of indicated or current status.
get_stat(key)[source]

Returns a stat that was previously reported. This is necessary for reporting new stats that are derived from two stats, one of which may have been reported by an earlier run. For example, if you first use report_result to report (number of trimmed reads), and then in a later stage want to report alignment rate, then this second stat (alignment rate) will require knowing the first stat (number of trimmed reads); however, that may not have been calculated in the current pipeline run, so we must retrieve it from the stats.tsv output file. This command will retrieve such previously reported stats if they were not already calculated in the current pipeline run. :param key: key of stat to retrieve

halt(checkpoint=None, finished=False, raise_error=True)[source]

Stop the pipeline before completion point.

Parameters:
  • checkpoint (str) – Name of stage just reached or just completed.
  • finished (bool) – Whether the indicated stage was just finished (True), or just reached (False)
  • raise_error (bool) – Whether to raise an exception to truly halt execution.
halted

Is the managed pipeline in a paused/halted state?

Return bool:Whether the managed pipeline is in a paused/halted state.
has_exit_status

Has the managed pipeline been safely stopped?

Return bool:Whether the managed pipeline’s status indicates that it has been safely stopped.
is_running

Is the managed pipeline running?

Return bool:Whether the managed pipeline is running.
make_sure_path_exists(path)[source]

Creates all directories in a path if it does not exist.

Parameters:path (str) – Path to create.
Raises:Exception – if the path creation attempt hits an error with a code indicating a cause other than pre-existence.
report_figure(key, filename, annotation=None)[source]

Writes a string to self.pipeline_figures_file.

Parameters:
  • key (str) – name (key) of the figure
  • filename (str) – relative path to the file (relative to parent output dir)
  • annotation (str) – By default, the figures will be annotated with the pipeline name, so you can tell which pipeline records which figures. If you want, you can change this.
report_result(key, value, annotation=None)[source]

Writes a string to self.pipeline_stats_file.

Parameters:
  • key (str) – name (key) of the stat
  • annotation (str) – By default, the stats will be annotated with the pipeline name, so you can tell which pipeline records which stats. If you want, you can change this; use annotation=’shared’ if you need the stat to be used by another pipeline (using get_stat()).
run(cmd, target=None, lock_name=None, shell='guess', nofail=False, errmsg=None, clean=False, follow=None, container=None)[source]

The primary workhorse function of PipelineManager, this runs a command.

This is the command execution function, which enforces race-free file-locking, enables restartability, and multiple pipelines can produce/use the same files. The function will wait for the file lock if it exists, and not produce new output (by default) if the target output file already exists. If the output is to be created, it will first create a lock file to prevent other calls to run (for example, in parallel pipelines) from touching the file while it is being created. It also records the memory of the process and provides some logging output.

Parameters:
  • cmd (str or list) – Shell command(s) to be run.
  • target (str or None) – Output file to be produced. Optional.
  • lock_name (str or None) – Name of lock file. Optional.
  • shell (bool) – If command requires should be run in its own shell. Optional. Default: “guess” –will try to determine whether the command requires a shell.
  • nofail (bool) – Whether the pipeline proceed past a nonzero return from a process, default False; nofail can be used to implement non-essential parts of the pipeline; if a ‘nofail’ command fails, the pipeline is free to continue execution.
  • errmsg (str) – Message to print if there’s an error.
  • clean (bool) – True means the target file will be automatically added to a auto cleanup list. Optional.
  • follow (callable) – Function to call after executing (each) command.
  • container (str) – Name for Docker container in which to run commands.
Returns:

Return code of process. If a list of commands is passed, this is the maximum of all return codes for all commands.

Return type:

int

set_status_flag(status)[source]

Configure state and files on disk to match current processing status.

Parameters:status (str) – Name of new status designation for pipeline.
start_pipeline(args=None, multi=False)[source]

Initialize setup. Do some setup, like tee output, print some diagnostics, create temp files. You provide only the output directory (used for pipeline stats, log, and status flag files).

stop_pipeline(status='completed')[source]

Terminate the pipeline.

This is the “healthy” pipeline completion function. The normal pipeline completion function, to be run by the pipeline at the end of the script. It sets status flag to completed and records some time and memory statistics to the log file.

time_elapsed(time_since)[source]

Returns the number of seconds that have elapsed since the time_since parameter.

Parameters:time_since (float) – Time as a float given by time.time().
timestamp(message='', checkpoint=None, finished=False, raise_error=True)[source]

Print message, time, and time elapsed, perhaps creating checkpoint.

This prints your given message, along with the current time, and time elapsed since the previous timestamp() call. If you specify a HEADING by beginning the message with “###”, it surrounds the message with newlines for easier readability in the log file. If a checkpoint is designated, an empty file is created corresponding to the name given. Depending on how this manager’s been configured, the value of the checkpoint, and whether this timestamp indicates initiation or completion of a group of pipeline steps, this call may stop the pipeline’s execution.

Parameters:
  • message (str) – Message to timestamp.
  • checkpoint (str, optional) – Name of checkpoint; this tends to be something that reflects the processing logic about to be or having just been completed. Provision of an argument to this parameter means that a checkpoint file will be created, facilitating arbitrary starting and stopping point for the pipeline as desired.
  • finished (bool, default False) – Whether this call represents the completion of a conceptual unit of a pipeline’s processing
:param raise_error : Whether to raise exception if
checkpoint or current state indicates that a halt should occur.

pypiper.NGSTk

class pypiper.ngstk.NGSTk(config_file=None, pm=None)[source]

Class to hold functions to build command strings used during pipeline runs. Object can be instantiated with a string of a path to a yaml pipeline config file. Since NGSTk inherits from AttributeDict, the passed config file and its elements will be accessible through the NGSTk object as attributes under config (e.g. NGSTk.tools.java). In case no config_file argument is passed, all commands will be returned assuming the tool is in the user’s $PATH.

Parameters:
  • config_file (str) – Path to pipeline yaml config file (optional).
  • pm (pypiper.PipelineManager) – A PipelineManager with which to associate this toolkit instance; that is, essentially a source from which to grab paths to tools, resources, etc.
Example:

from pypiper.ngstk import NGSTk as tk tk = NGSTk() tk.samtools_index(“sample.bam”) # returns: samtools index sample.bam

# Using a configuration file (custom executable location): from pypiper.ngstk import NGSTk tk = NGSTk(“pipeline_config_file.yaml”) tk.samtools_index(“sample.bam”) # returns: /home/.local/samtools/bin/samtools index sample.bam

bam2fastq(input_bam, output_fastq, output_fastq2=None, unpaired_fastq=None)[source]

Create command to convert BAM(s) to FASTQ(s).

Parameters:
  • input_bam (str) – Path to sequencing reads file to convert
  • output_fastq – Path to FASTQ to write
  • output_fastq2 – Path to (R2) FASTQ to write
  • unpaired_fastq – Path to unpaired FASTQ to write
Return str:

Command to convert BAM(s) to FASTQ(s)

bam_conversions(bam_file, depth=True)[source]

Sort and index bam files for later use. :param depth: also calculate coverage over each position

bam_to_bigwig(input_bam, output_bigwig, genome_sizes, genome, tagmented=False, normalize=False, norm_factor=1000)[source]

Convert a BAM file to a bigWig file.

Parameters:
  • input_bam (str) – path to BAM file to convert
  • output_bigwig (str) – path to which to write file in bigwig format
  • genome_sizes (str) – path to file with chromosome size information
  • genome (str) – name of genomic assembly
  • tagmented (bool) – flag related to read-generating protocol
  • normalize (bool) – whether to normalize coverage
  • norm_factor (int) – number of bases to use for normalization
Return list[str]:
 

sequence of commands to execute

bam_to_fastq(bam_file, out_fastq_pre, paired_end)[source]

Build command to convert BAM file to FASTQ file(s) (R1/R2).

Parameters:
  • bam_file (str) – path to BAM file with sequencing reads
  • out_fastq_pre (str) – path prefix for output FASTQ file(s)
  • paired_end (bool) – whether the given file contains paired-end or single-end sequencing reads
Return str:

file conversion command, ready to run

bam_to_fastq_awk(bam_file, out_fastq_pre, paired_end)[source]

This converts bam file to fastq files, but using awk. As of 2016, this is much faster than the standard way of doing this using Picard, and also much faster than the bedtools implementation as well; however, it does no sanity checks and assumes the reads (for paired data) are all paired (no singletons), in the correct order.

bam_to_fastq_bedtools(bam_file, out_fastq_pre, paired_end)[source]

Converts bam to fastq; A version using bedtools

calc_frip(input_bam, input_bed, threads=4)[source]

Calculate fraction of reads in peaks.

A file of with a pool of sequencing reads and a file with peak call regions define the operation that will be performed. Thread count for samtools can be specified as well.

Parameters:
  • input_bam (str) – sequencing reads file
  • input_bed (str) – file with called peak regions
  • threads (int) – number of threads samtools may use
Returns:

fraction of reads in peaks defined in the given peaks file

Return type:

float

check_command(command)[source]

Check if command can be called.

check_fastq(input_files, output_files, paired_end)[source]

Returns a follow sanity-check function to be run after a fastq conversion. Run following a command that will produce the fastq files.

This function will make sure any input files have the same number of reads as the output files.

check_trim(trimmed_fastq, paired_end, trimmed_fastq_R2=None, fastqc_folder=None)[source]

Build function to evaluate read trimming, and optionally run fastqc.

This is useful to construct an argument for the ‘follow’ parameter of a PipelineManager’s ‘run’ method.

Parameters:
  • trimmed_fastq (str) – Path to trimmed reads file.
  • paired_end (bool) – Whether the processing is being done with paired-end sequencing data.
  • trimmed_fastq_R2 (str) – Path to read 2 file for the paired-end case.
  • fastqc_folder – Path to folder within which to place fastqc output files; if unspecified, fastqc will not be run.
Returns:

Function to evaluate read trimming and possibly run fastqc.

Return type:

callable

count_fail_reads(file_name, paired_end)[source]

Counts the number of reads that failed platform/vendor quality checks. :param paired_end: This parameter is ignored; samtools automatically correctly responds depending on the data in the bamfile. We leave the option here just for consistency, since all the other counting functions require the parameter. This makes it easier to swap counting functions during pipeline development.

count_flag_reads(file_name, flag, paired_end)[source]

Counts the number of reads with the specified flag.

Parameters:
  • file_name (str) – name of reads file
  • flag (str) – sam flag value to be read
  • paired_end – This parameter is ignored; samtools automatically correctly responds depending

on the data in the bamfile. We leave the option here just for consistency, since all the other counting functions require the parameter. This makes it easier to swap counting functions during pipeline development. :type paired_end: bool

count_lines(file_name)[source]

Uses the command-line utility wc to count the number of lines in a file.

Parameters:file_name (str) – name of file whose lines are to be counted
count_lines_zip(file_name)[source]

Uses the command-line utility wc to count the number of lines in a file. For compressed files. :param file: file_name

count_mapped_reads(file_name, paired_end)[source]

Mapped_reads are not in fastq format, so this one doesn’t need to accommodate fastq, and therefore, doesn’t require a paired-end parameter because it only uses samtools view. Therefore, it’s ok that it has a default parameter, since this is discarded.

Parameters:
  • file_name (str) – File for which to count mapped reads.
  • paired_end – This parameter is ignored; samtools automatically correctly responds depending

on the data in the bamfile. We leave the option here just for consistency, since all the other counting functions require the parameter. This makes it easier to swap counting functions during pipeline development. :type paired_end: bool :return: Either return code from samtools view command, or -1 to

indicate an error state.
Return type:int
count_multimapping_reads(file_name, paired_end)[source]

Counts the number of reads that mapped to multiple locations. Warning: currently, if the alignment software includes the reads at multiple locations, this function will count those more than once. This function is for software that randomly assigns, but flags reads as multimappers.

Parameters:
  • file_name (str) – name of reads file
  • paired_end – This parameter is ignored; samtools automatically correctly responds depending

on the data in the bamfile. We leave the option here just for consistency, since all the other counting functions require the parameter. This makes it easier to swap counting functions during pipeline development.

count_reads(file_name, paired_end)[source]

Count reads in a file.

Paired-end reads count as 2 in this function. For paired-end reads, this function assumes that the reads are split into 2 files, so it divides line count by 2 instead of 4. This will thus give an incorrect result if your paired-end fastq files are in only a single file (you must divide by 2 again).

Parameters:
  • file_name (str) – Name/path of file whose reads are to be counted.
  • paired_end (bool) – Whether the file contains paired-end reads.
count_unique_mapped_reads(file_name, paired_end)[source]

For a bam or sam file with paired or or single-end reads, returns the number of mapped reads, counting each read only once, even if it appears mapped at multiple locations.

Parameters:
  • file_name (str) – name of reads file
  • paired_end (bool) – True/False paired end data
Returns:

Number of uniquely mapped reads.

Return type:

int

count_unique_reads(file_name, paired_end)[source]

Sometimes alignment software puts multiple locations for a single read; if you just count those reads, you will get an inaccurate count. This is _not_ the same as multimapping reads, which may or may not be actually duplicated in the bam file (depending on the alignment software). This function counts each read only once. This accounts for paired end or not for free because pairs have the same read name. In this function, a paired-end read would count as 2 reads.

count_uniquelymapping_reads(file_name, paired_end)[source]

Counts the number of reads that mapped to a unique position.

Parameters:
  • file_name (str) – name of reads file
  • paired_end (bool) – This parameter is ignored.
fastqc(file, output_dir)[source]

Create command to run fastqc on a BAM file (or FASTQ file, right?)

Parameters:
  • file (str) – Path to file with sequencing reads
  • output_dir (str) – Path to folder in which to place output
Return str:

Command with which to run fastqc

fastqc_rename(input_bam, output_dir, sample_name)[source]

Create pair of commands to run fastqc and organize files.

The first command returned is the one that actually runs fastqc when it’s executed; the second moves the output files to the output folder for the sample indicated.

Parameters:
  • input_bam (str) – Path to file for which to run fastqc.
  • output_dir (str) – Path to folder in which fastqc output will be written, and within which the sample’s output folder lives.
  • sample_name (str) – Sample name, which determines subfolder within output_dir for the fastqc files.
Returns:

Pair of commands, to run fastqc and then move the files to their intended destination based on sample name.

Return type:

list[str]

filter_reads(input_bam, output_bam, metrics_file, paired=False, cpus=16, Q=30)[source]

Remove duplicates, filter for >Q, remove multiple mapping reads. For paired-end reads, keep only proper pairs.

get_chrs_from_bam(file_name)[source]

Uses samtools to grab the chromosomes from the header that are contained in this bam file.

get_file_size(filenames)[source]

Get size of all files in string (space-separated) in megabytes (Mb).

Parameters:filenames (str) – a space-separated string of filenames
get_frip(sample)[source]

Calculates the fraction of reads in peaks for a given sample. :param sample: A Sample object with the “peaks” attribute. :type sample: pipelines.Sample

get_input_ext(input_file)[source]

Get the extension of the input_file. Assumes you’re using either .bam or .fastq/.fq or .fastq.gz/.fq.gz.

get_mitochondrial_reads(bam_file, output, cpus=4)[source]
get_peak_number(sample)[source]

Counts number of peaks from a sample’s peak file. :param sample: A Sample object with the “peaks” attribute. :type sample: pipelines.Sample

get_read_type(bam_file, n=10)[source]

Gets the read type (single, paired) and length of bam file. :param bam_file: Bam file to determine read attributes. :type bam_file: str :param n: Number of lines to read from bam file. :type n: int :returns: tuple of (read_type=string, read_length=int). :rtype: tuple

input_to_fastq(input_file, sample_name, paired_end, fastq_folder, output_file=None, multiclass=False)[source]

Builds a command to convert input file to fastq, for various inputs.

Takes either .bam, .fastq.gz, or .fastq input and returns commands that will create the .fastq file, regardless of input type. This is useful to made your pipeline easily accept any of these input types seamlessly, standardizing you to the fastq which is still the most common format for adapter trimmers, etc.

It will place the output fastq file in given fastq_folder.

Parameters:input_file (string) – filename of the input you want to convert to fastq
Returns:A command (to be run with PipelineManager) that will ensure your fastq file exists.
macs2_call_peaks(treatment_bams, output_dir, sample_name, genome, control_bams=None, broad=False, paired=False, pvalue=None, qvalue=None, include_significance=None)[source]

Use MACS2 to call peaks.

Parameters:
  • treatment_bams (str | Iterable[str]) – Paths to files with data to regard as treatment.
  • output_dir (str) – Path to output folder.
  • sample_name (str) – Name for the sample involved.
  • genome (str) – Name of the genome assembly to use.
  • control_bams (str | Iterable[str]) – Paths to files with data to regard as control
  • broad (bool) – Whether to do broad peak calling.
  • paired (bool) – Whether reads are paired-end
  • pvalue (NoneType | float) – Statistical significance measure to pass as –pvalue to peak calling with MACS
  • qvalue (NoneType | float) – Statistical significance measure to pass as –qvalue to peak calling with MACS
  • include_significance (NoneType | bool) – Whether to pass a statistical significance argument to peak calling with MACS; if omitted, this will be True if the peak calling is broad or if either p-value or q-value is specified; default significance specification is a p-value of 0.001 if a significance is to be specified but no value is provided for p-value or q-value.
Returns:

Command to run.

Return type:

str

make_dir(path)[source]

Forge path to directory, creating intermediates as needed.

Parameters:path (str) – Path to create.
make_sure_path_exists(path)[source]

Alias for make_dir

merge_bams(input_bams, merged_bam, in_sorted='TRUE', tmp_dir=None)[source]

Combine multiple files into one.

The tmp_dir parameter is important because on poorly configured systems, the default can sometimes fill up.

Parameters:
  • input_bams (Iterable[str]) – Paths to files to combine
  • merged_bam (str) – Path to which to write combined result.
  • in_sorted (bool | str) – Whether the inputs are sorted
  • tmp_dir (str) – Path to temporary directory.
merge_fastq(inputs, output, run=False, remove_inputs=False)[source]

Merge FASTQ files (zipped or not) into one.

Parameters:
  • inputs (Iterable[str]) – Collection of paths to files to merge.
  • output (str) – Path to single output file.
  • run (bool) – Whether to run the command.
  • remove_inputs (bool) – Whether to keep the original files.
Return NoneType | str:
 

Null if running the command, otherwise the command itself

Raises:

ValueError – Raise ValueError if the call is such that inputs are to be deleted but command is not run.

This function standardizes various input possibilities by converting either .bam, .fastq, or .fastq.gz files into a local file; merging those if multiple files given.

Parameters:
  • input_args (list) – This is a list of arguments, each one is a class of inputs (which can in turn be a string or a list). Typically, input_args is a list with 2 elements: first a list of read1 files; second an (optional!) list of read2 files.
  • raw_folder (str) – Name/path of folder for the merge/link.
  • local_base (str) – Usually the sample name. This (plus file extension) will be the name of the local file linked (or merged) by this function.
parse_bowtie_stats(stats_file)[source]

Parses Bowtie2 stats file, returns series with values. :param stats_file: Bowtie2 output file with alignment statistics. :type stats_file: str

parse_duplicate_stats(stats_file)[source]

Parses sambamba markdup output, returns series with values. :param stats_file: sambamba output file with duplicate statistics. :type stats_file: str

parse_qc(qc_file)[source]

Parse phantompeakqualtools (spp) QC table and return quality metrics.

Parameters:qc_file (str) – Path to phantompeakqualtools output file, which contains sample quality measurements.
plot_atacseq_insert_sizes(bam, plot, output_csv, max_insert=1500, smallest_insert=30)[source]

Heavy inspiration from here: https://github.com/dbrg77/ATAC/blob/master/ATAC_seq_read_length_curve_fitting.ipynb

run_spp(input_bam, output, plot, cpus)[source]

Run the SPP read peak analysis tool.

Parameters:
  • input_bam (str) – Path to reads file
  • output (str) – Path to output file
  • plot (str) – Path to plot file
  • cpus (int) – Number of processors to use
Returns:

Command with which to run SPP

Return type:

str

sam_conversions(sam_file, depth=True)[source]

Convert sam files to bam files, then sort and index them for later use. :param depth: also calculate coverage over each position

samtools_index(bam_file)[source]

Index a bam file.

samtools_view(file_name, param, postpend='')[source]

Run samtools view, with flexible parameters and post-processing.

This is used internally to implement the various count_reads functions.

Parameters:
  • file_name (str) – file_name
  • param (str) – String of parameters to pass to samtools view
  • postpend (str) – String to append to the samtools command; useful to add cut, sort, wc operations to the samtools view output.
skewer(input_fastq1, output_prefix, output_fastq1, log, cpus, adapters, input_fastq2=None, output_fastq2=None)[source]

Create commands with which to run skewer.

Parameters:
  • input_fastq1 (str) – Path to input (read 1) FASTQ file
  • output_prefix (str) – Prefix for output FASTQ file names
  • output_fastq1 (str) – Path to (read 1) output FASTQ file
  • log (str) – Path to file to which to write logging information
  • | str cpus (int) – Number of processing cores to allow
  • adapters (str) – Path to file with sequencing adapters
  • input_fastq2 (str) – Path to read 2 input FASTQ file
  • output_fastq2 (str) – Path to read 2 output FASTQ file
Return list[str]:
 

Sequence of commands to run to trim reads with skewer and rename files as desired.

spp_call_peaks(treatment_bam, control_bam, treatment_name, control_name, output_dir, broad, cpus, qvalue=None)[source]

Build command for R script to call peaks with SPP.

Parameters:
  • treatment_bam (str) – Path to file with data for treatment sample.
  • control_bam (str) – Path to file with data for control sample.
  • treatment_name (str) – Name for the treatment sample.
  • control_name (str) – Name for the control sample.
  • output_dir (str) – Path to folder for output.
  • broad (str | bool) – Whether to specify broad peak calling mode.
  • cpus (int) – Number of cores the script may use.
  • qvalue (float) – FDR, as decimal value
Returns:

Command to run.

Return type:

str

validate_bam(input_bam)[source]

Wrapper for Picard’s ValidateSamFile.

Parameters:input_bam (str) – Path to file to validate.
Returns:Command to run for the validation.
Return type:str