One of the unique features of the FAKtory is its ability to
automatically generate and utilize a collection of constraints that
model additional information about the particular protocol being
employed to sequence a target DNA strand.
Making FAKtory Support Many Sequencing Protocols
One
of the unique features of the FAKtory is its ability to automatically
generate and utilize a collection of constraints that model additional
information about the particular protocol being employed to sequence a
target DNA strand. For example, many labs sequence both ends of their
clone inserts in which case one knows that the pair of fragment
sequences are on opposite strands and should be within a certain
distance from each other in any proposed assembly. As another example,
in transposon-mapped sequencing, the two fragments primed from a given
transposon have a 4 or 5 base-pair overlap and are on opposite strands
(we say they have the opposite orientation). Moreover,
fragments from adjacent transposons are expected to overlap and again
be in the opposite orientation. The FAKII assembly kernel, upon which
the FAKtory is based, is the only software suite to provide a mechanism
for describing protocol-induced constraints on the range of possible
solutions, and to employ it, a priori, in computing its
answers. All other systems either ignore such information or tell the
users about violations of such constraints, a posteriori.
The basis of the constraint framework is that the additional protocol-dependent information can be expressed as a set of overlap, orientation, and distance constraints between pairs
of fragment sequences. While we can't prove that every conceivable
protocol can be expressed in such a framework, it is the case that all
the protocols commonly in use today are expressible in this framework.
The FAKtory has a Constraint Definition
panel in which a user can specify and give a symbolic name to a
constraint relationship between a pair of fragments. For example, one
might define Dual_ends to be the
constraint that a pair of fragments be (1) in the opposite orientation,
and (2) that their 5' ends be a distance, say 800 to 5000 base pairs,
apart. As another example, one might define Nested_deletion to be the constraint that the 3' end of the first fragment overlap the 5' end of the second. In general, a constraint
may be any set of overlap, orientation and distance constraints that a
user might wish to simultaneously be true about the relationship
between two fragment sequences in an assembly. Overlap constraints may
be oriented in the sense that the first must be 5' of the second.
Moreover, the type of an overlap relationship can be controlled in a
detailed way, and distance relationships can be with respect to any anchor position relative to the fragments. Users must also assign a color to every constraint for later display in assemblies.
Now
that one can define different types of constraint relationships between
fragments, the problem arises as to how one designates that a
particular pair of fragments has such a constraint between them. The
FAKtory's solution is embodied in the idea of matchers that
automatically associate a constraint between two fragments on the basis
of their labels. For example, it is common practice to label sequences
obtained from opposing ends of an insert something like, Aname.f and Aname.r, and if one later produces a long read of the .f-end of the insert with special chemistry or a different machine, to label that fragment something like Aname.fi. As another example, when doing nested deletion sequencing, one typically labels successive reads as, say Dname.100, Dname.101, Dname.102,
and so on. Clearly, with such labeling schemes in place, one should be
able to build a simple pattern matching mechanism that identifies
appropriately labeled pairs of fragments and associates a given
constraint relationship between them.
Such mechanisms can be set up on the Constraint Matcher panel in the following way. One can bind any capital letter, called a variable, to a regular expression. One may then specify a binding template that is a regular expression where some elements may be variables between square brackets, e.g. [A]. Finally, one specifies a mating template that is a string some of whose elements may be the special forms [X], [X+i], or [X+i!]. The variable X must have been used in the binding template, i is an integer constant, and the later two forms are permissible only if X's
regular expression matches sequences of digits. A matcher consists of a
template pair and a constraint relationship defined previously on the
Constraint Definition panel.
When it is time to infer which constraints apply to which pairs of fragments (see Section 9),
the mechanism works as follows. The binding template of a matcher is
applied to a fragment's label to see if it matches. If it does, then
the exact substring of the label matched by each variable in the
binding template is determined. A mate label is then generated by
replacing every variable reference in the mating template with the
corresponding substring from the binding match. In the case of the [X+i] and [X+i!] forms, the integer i
is added to the value of the number denoted by the substring, and in
the later case, leading 0's are preserved. Once the mate label has been
determined, FAKtory searches for a fragment whose label coincides and
if found, it then designates this fragment and the one whose label
matched the binding template and associates the matcher's constraint to
them.
For example, suppose A is bound to [a-zA-Z]* (matches any sequence of letters) and N is bound to [0-9]* (matches any sequence of digits). First, consider the template pair: [A].f and [A].r. The binding template matches Aname.f and the corresponding mate label is Aname.r. Next, consider the template pair: [A].[N] and [A].[N+1]. The binding template matches Dname.100 and the corresponding mate label is Dname.101. Also, the binding template matches Dname.001 and produces mate Dname.2. If one wishes to keep the leading 0's in the mate label then one should use [A].[N+1!] as the mating template, in which case the mate would be Dname.002.
While
the FAKtory's matching mechanism is not a universal panacea for the
problem of automatically inferring constraint relationships, it is
powerful enough to capture a wide range of existing data sets, and
certainly more than adequate for future projects where labeling
conventions can be chosen to permit its application.
|