API - class/function docs

itermae is intended to be used as a command line utility, but here’s list of the internal functions and classes to orient y’all for debugging (and contributing to development?).

Essentially, the bin/itermae launcher CLI script reads arguments in, creates a Configuration class, tells it to configure with certain arguments, then tells it to start reading with method .reader(). Internally, it creates a SeqHolder object for each input read, which handles all the sequence intermediates and outputing. There are a few other little utility functions/modules.

The below is automatically generated from the function-level docstrings:

class itermae.Configuration

This class is for configuring itermae, from YAML or CLI arguments. No arguments for initializing, it will set default values. Then you use the configuration methods.

check_reserved_name(name, reserved_names=['dummyspacer', 'input', 'id', 'description'])

This checks if the name is one of a reserved list, and raises error if so. These names are reserved for these reasons:

  • dummyspacer is so you can pop an X into your sequence as a separator

delimiter for later processing - input is the input group, the original one - id is the input ID, here just as id so it`s easy to find - description is for mapping over the FASTQ description

Parameters

name (str) – name of group

Raises

ValueError – raised if you’re using one of the reserved names…

close_fhs()

This is for cleaning up, and tries to close file handles at input_seqs, ouput_fh, failed_fh, report_fh.

config_from_args(args_copy)

Make configuration object from arguments provided. Should be the same as the config_from_yaml output, if supplied the same.

Parameters

args_copy (argparse object, I think) – pass in the argparse args object after collecting the startup command line arguments

Raises
  • ValueError – I failed to build the regular expression for a match

  • ValueError – The output IDs, seqs, descriptions, and filters are of unequal sizes, make them equal or only define one of each

  • ValueError – Either the supplied filter, id, seq, or description expression for a match group does not look like a python expression

config_from_file(file_path)

Tries to parse a configuration YAML file to update this configuration object. Pass in the file path as an argument. Recommend you run this config first, then config_from_args, as done in bin/itermae.

Parameters

file_path (str) – file path to configure from, expecting it to point to an appropriately formatted YAML file

Raises
  • ValueError – Failure to parse the supplied YAML

  • KeyError – You need to define a group called pattern: inside each of the list inside of matches:

  • ValueError – Error in yaml config, you`ve repeated a group marking character to match in multiple places

  • ValueError – Error in yaml config, the pattern and marking you`ve defined are of different lengths

  • ValueError – Error in yaml config

  • KeyError – Marked roup in marking: field does not have corresponding entry in marked_groups:.

  • ValueError – Either the supplied filter, id, seq, or description expression for a match group does not look like a python expression

get_input_seqs()

This calls open_input_fh() to set the input_fh attribute, then calls open_appropriate_input_format to use this and the input_format attribute to save an iterator of SeqRecords into input_seqs.

Note this is inconsistent with design of the output, will pick one or the other … later.

open_appropriate_input_format()

Uses input_format and input_fh to set iterators of SeqRecords from the appropriate inputs, in input_seqs. Tries to handle all formats known, but will try with SeqIO in case there’s one I didn’t think about.

open_input_fh()

Opens file-handle based on the configuration. Requires input to be set.

Raises

ValueError – Can’t handle gzipped inputs on STDIN.

open_output_fh(file_string)

Opens output file handle, which can then be written to later with a format specification.

Note this is inconsistent with design of the input, will pick one or the other … later.

Parameters

file_string (str) – file to wrote to, or STDOUT or STDERR

Returns

file string for appending output

Return type

file handle returned by open()

reader()

This reads inputs, calls the chop method on each one, and sorts it off to outputs. So this is called by the main function, and is mostly about handling the I/O and handing it to the chop function. Thus, this depends on the Configuration class being properly configured with all the appropriate values.

class itermae.SeqHolder(input_record, configuration)
This is the main holder of sequences, and has methods for doing matching,

building contexts, filtering, etcetra. Basically there is one of these initialized per input, then each operation is done with this object, then it generates the appropriate outputs and chop actually writes them. Used in chop.

The .seqs attribute holds the sequences accessed by the matching, initialized with the input_record SeqRecord and a dummyspacer for output formatting with a separator.

param input_record

an input SeqRecord object

type input_record

Bio.SeqRecord.SeqRecord

param configuration

the whole program’s Configuration object, with appropriate file-handles opened up and defaults set

type configuration

itermae.Configuration

# :raises [ErrorType]: [ErrorDescription] # :return: [ReturnDescription] # :rtype: [ReturnType]

apply_operation(match_id, input_group, regex)

This applies the given match to the SeqHolder object, and saves how it did internally.

Parameters
  • match_id (str) – what name should we call this match? This is useful for debugging reports and filtering only.

  • input_group (str) – which input group to use, by name of the group

  • regex (regex compiled regular expression object) – the regular expression to apply, complete with named groups to save for subsequent match operations

Returns

self, this is just done so it can exit early if no valid input

Return type

itermae.SeqHolder

build_context()

This unpacks group match stats/scores into an environment that the filter can then use to … well … filter.

build_output(output_dict)

Builds the output from the SeqHolder object according to the outputs in output_dict.

Parameters

output_dict (dict) – a dictionary of outputs to form, as generated from the configuration initialization

Returns

the successfully built SeqRecord, or None if it fails

Return type

Bio.SeqRecord.SeqRecord or None

chop()

This executes the intended purpose of the SeqRecord object, and is called once. It uses the configured object to apply each match operation as best it can with the sequences it is given or can generate, then writes the outputs in the specified formats to specified places as configured.

evaluate_filter_of_output(output_dict)

This tests a user-defined filter on the ‘seq_holder’ object. This has already been compile’d, and here we just attempt to evaluate these to True, where True is passing the filter. Exceptions are blocked by using try/except so that it can fail on a single match and move onto the next match/read.

Parameters

output_dict (dict) – a dictionary of outputs to form, as generated from the configuration initialization

Returns

True if the filter passed and the output should be generated

Return type

bool

format_report(label, output_seq)

Formats a standard report line for the debug reporting function.

Parameters
  • label (Bio.SeqRecord.SeqRecord or None) – what type of report line this is, so a string describing how it went - passed? Failed?

  • label – the attempt at generating an output SeqRecord, so either one that was formed or None

Returns

the string for the report

Return type

str