This section explains the ability to clip, screen, and tag within the FAKtory software.
A Unified Framework for Clipping, Screening, and Tagging
The pipeline must always consist of an initial Input stage that imports
fragment records from the file system, an Overlap stage which computes
all overlaps between fragment sequences, and a final Assembly stage
that melds the fragments into a reconstruction of the target. Between
the Input and Overlap stages may be any number and combination of
Prescreener stages. In the design of FAKtory, we chose to develop a
single, unifying framework in which one could formulate a wide range of
criteria and recipes for clipping, screening, and tagging fragment
sequences. This framework involves a set of five types of pattern
recognizers and a small expression language for flexibly combining the
results of these recognizers.
We start by describing such a
general-purpose Prescreener stage whose configuration panel presents
the user with the full power of the framework. A Prescreener stage
consists of several prescreeners, each of which can be
programmed to either cut off a 5'- or 3'-end of a fragment's sequence,
or to tag substrings of the sequence with a specifiable color and
symbolic name. The interval(s) of a fragment's sequence which will be
clipped or tagged by a presceener are specified by an interval expression which is basically a pattern that matches
a set of disjoint substrings, specified as intervals of character
positions. We will describe interval expressions and the intervals they
match in a bottom up fashion by starting with the simplest:
- An interval expression can be a recognizer.
Each recognizer matches either a single interval or a collection of
disjoint intervals of positions in the fragment sequence to which it is
being applied. There are five types of recognizers:
- An interval recognizer is just a fixed interval [I,J]. For example, [0,20] matches the first 20 bases of a sequence, [-20,-0] matches the last 20 bases. One may also specify this in percentage terms, so [0,20]% matches the first 20% of the bases in a sequence.
- A regular expression
recognizer is a regular expression, an error tolerance, and a
designation that one wants the 3'-most, 5'-most, or all matches to the
expression. The recognizer returns an interval or intervals that match
the regular expression within the given number of errors. It is useful
for short patterns such as restriction enzyme cut sites.
- There are several signal
recognizers that match intervals of a sequence based on the measure of
the signal/noise ratio, peak-height/max-height ratio, and peak width in
a window of specifiable length.
- A frequency recognizer
matches intervals in which the frequency of specifiable bases
(including N's) is above or below a given level in a window of a given
length.
- An overlap recognizer matches any intervals
that overlap, within a certain match stringency, a reference sequence
from a user-specified library of such sequences. The library typically
contains things such as consensus Alu and Line elements, and vectors
such as variants of PUC commonly used in the lab. These recognizers are
useful for tagging repeats and identifying vector sequence that needs
to be trimmed.
These base recognizers are configured in a Recognizers sub-panel to each Prescreener
panel that is dedicated to that purpose. Each recognizer is given a
name so that it can then be referred to in an interval expression for a
prescreener.
- Any set-theoretic combination, X op Y, where X and Y are interval expressions, matches the appropriate combinations of the intervals matched by X and Y. The operator symbols used are | for union, & for intersection, and - for minus. Also !X matches the complement of X's intervals with respect to the fragment sequence.
- The expression X+c, where X is an interval expression and c is an integer constant, matches the intervals matched by X all shifted in the 3' direction by c positions. The expression X-c similarly shifts X's intervals in the 5' direction.
- The expression [X,Y], matches the interval that starts at the 5'-end of 5'-most interval matched by X and ends at the 3'-end of 3'-most interval matched by Y. If X doesn't match anything then the expression is equivalent to [Y,Y], if Y doesn't match anything, to [X,X],
and if both don't match anything then the expression doesn't match
anything. One may also specify open intervals at either end by
replacing [ with (, and ] with ).
- The expression X ? Y : Z matches Y if X matches something and Z otherwise. Similarly X ? Y matches Y if X is matches something and doesn't match anything otherwise, and X : Z matches Z if X doesn't match anything and matches X otherwise.
- The expression X(Y) matches the everything matched by X when evaluated over the substrings of the fragment's sequence matched by Y.
This simple interval expression language is sufficient to describe
quite complex clipping or tagging criteria. For example, if one wanted
to clip at the clone insertion restriction site, or at the 50th base if
such a site cannot be found because of poor signal quality in the
initial part of the read, then one can express this with the interval
expression [ 0 , Site(Intv) : Intv ] where Intv is the interval recognizer [0,50], and Site is a regular expression recognizer for the cut site with say 1 mismatch allowed and optioned to return the 5'-most instance.
In
designing the general Prescreener-type stages above, we again came up
against the problem of the desire for generality resulting in a
mechanism that required significant skill to utilize. Often, however,
the full power and concomitant complexity of the full framework is not
needed. To alleviate this problem, we set about designing simpler,
specialized interfaces called Clip, Screen, and Tag stages that are
sub-classes of prescreeners directly suited to expressing common
clipping, vector screening, and element tagging functions. We give a
quick overview of each of these special panels:
- Vector:
This panel is restricted to building a set of clipping prescreeners
each of which is a single overlap recognizer. The design and layout of
the panel have been tailored for this simple subset of the prescreener
capability. In essence, the user is presented with a panel where they
select the reference vector sequences they wish to screen out.
- Tag Panels:
This panel is restricted to building a set of tagging prescreeners,
each of which is a single regular expression, frequency, or overlap
recognizer that is automatically optioned to report all matches. Like
the other panels its design and layout are tailored to present a simple
interface to this subclass.
- Clip Panels: This panel is
restricted to building a set of 5' clipping prescreeners and a set of
3' clipping prescreeners each of which is a single interval, regular
expression, signal, or frequency recognizer that is automatically
optioned for matching the 3'- or 5'-most occurrence, respectively. The
5' clipping prescreener is guaranteed to clip from the start of the
fragment sequence to the 3' end of its recognizers match (if any).
Symmetrically, the 3' clipping prescreener clips from the end of the
fragment to the 5' end of its recognizers match (if any).
We
find that in practice these simple sub-classes suffice to express most
of the preprocessing needed on fragment sequences before computing
overlaps between them and then assembling them into contigs. Only on
occasion is the full power of interval expressions required. As a final
note, every clip, tag, and vector panel can be viewed as a general
prescreener if desired. The prescreeners therein may then be modified
using the more powerful console of the Prescreener panel. One may
always flip back to the original subclass, providing the
specification has not changed. This permits users to learn about the
Prescreener by seeing how Clip, Vector, and Tag specifications are
codified as Prescreener specifications.
|