A finite-state automaton is a device that can be in one of a finite
number of *states* . In certain conditions, it can switch to
another state. This is called a *transition* .
When the automaton starts working (when it is switched on), it can be
in one of its initial states. There is also another important subset
of states of the automaton: the final states. If the automaton is in a
final state when it stops working, it is said to *accept* its
input. The input is a sequence of symbols. The interpretation of the
symbols depends on the application; they usually represent events, or
can be interpreted as ``the event that a particular data became
available''. The symbols must come from a finite set of symbols,
called the *alphabet* . If a particular symbol in a
particular state triggers a transition from that state to another one,
that transition is *labeled* with that symbol. The labels of
transitions can contain one particular symbol that is not in the
alphabet. A transition is labeled with (not present in the
alphabet) if it can be traversed with no input symbol.

It is convenient to present automata as directed graphs. The vertices denote states. They are portrayed as small circles. The transitions form the edges - arcs with arrows pointing from the source state (the state where the transition originates) to the target state. They are labeled with symbols. Unless it is clear from the context, the initial states have short arrows that point to them from ``nowhere''. The final states are represented as two concentric circles.

**Figure 2.1:** Sample finite-state automaton

The example automaton (fig. 2.1) contains 6 states
labeled *A*, *B*, *C*, *D*, *E* and *F*. The state labeled *A* is initial. The
states *A*, *C*, and *D* are final. The automaton contains no
cycles, as it is impossible to leave a state, follow some transitions,
and return in this way to the same original state.

Starting in one of the initial states, the automaton processes a sequence of input symbols. In a given state, it checks if the next input symbol matches any of the labels of the transitions that go out of the state. If so, the automaton follows it to the target state of the transition. If is among the labels, the transition it labels is followed without reading the input. There may be more than one transition with the same label, or a transition with a matching label and a transition labeled with . In that case, the decision which path to choose is arbitrary. If there is no transition that matches the input symbol (in particular, if there is no -transition), the automaton stops and rejects the input. The input is accepted when all input is read and matched by transitions and the automaton is in a final state.

A set of sequences of symbols gathered by pursuing all paths from
initial states to the final ones is called the *
language*
accepted by the
automaton. E.g. the example automaton (fig. 2.1)
accepts the language . The symbol
, meaning an empty input, is accepted because the initial
state *A* is also the final one. *a* is accepted, because
there is a transition from *A* to *C* labeled with *a*,
and *C* is a final state. The input *bc* is accepted,
because the control in the automaton can go from *A* to *F* on
*a*, and then it can follow the -labeled transition to
*B*, and then a transition labeled *c* to *D*, which is a
final state. The sequences: {ad, ae, af, ba} are all rejected,
because *E* is not a final state. The language of the automaton
in fig. 2.1 is finite, because the automaton
contains no cycles.

An automaton that has no -labeled transitions is called
-free. An automaton that has one initial state, is
-free, and for each state there is at most one transition
labeled with the same symbol is called a *deterministic finite-state
automaton* .
The automaton from fig. 2.1 can be transformed into
the deterministic automaton from fig. 2.2.

**Figure 2.2:** Deterministic version of the sample automaton

A state is said to be *reachable*
from another state iff there is a
sequence of transitions leading to the state from that other state. An
automaton is said to be *start-useful*
iff every state is reachable from any of its
initial states. An automaton is *useful*
iff it is start-useful and for every state
there is at least one final state reachable from that state. The automaton from
the fig. 2.2 is not useful; its useful version is
presented in fig. 2.3. That version accepts the same
language.

**Figure 2.3:** Useful version of the sample deterministic automaton

An automaton is said to be *acyclic*
when it contains no cycles, i.e. it is not
possible to arrive at the same state twice when following the
transitions. All automata in figures presented in this dissertation
are acyclic.

Formally, an automaton is a 5-tuple , where Q is a set of states, is a set of initial (or starting)
states, is a set of final states, is the
alphabet , and is a partial mapping
denoting transitions. We extend the mapping to . The size of the
automaton, |*M*|, is equal to the number of states |*Q*|. We define
the language
accepted by the automaton *M* to be:

Let the mapping be the *right
language* of a state
*q* in *M* (the set of all strings over that will take one
from state *q* to any final state of M using the extended transition
relation ):

The consequence of the Myhill-Nerode theorem
([HU79]) is that among many automata that accept a
given language there is one (up to isomorphism) that has a
minimal number of states. It is called a *minimal
automaton* . The
automaton in fig. 2.3 is minimal.

The property of minimality of is defined as follows ([Wat93a], [Wat93b], [Wat95]):

We will use, however, an alternative definition of minimality of ([Wat93a], [Wat93b]):

is a property specifying that all states can be reached from the start state, and it is defined as follows:

Of particular interest for the NLP community are deterministic acyclic finite state automata (DAFSA) , because they are best suited for representing dictionaries (of various sorts). Whenever we refer to an automaton (or FSA) throughout this paper, we mean an acyclic, deterministic finite state automaton (DAFSA), unless explicitly stated otherwise. Several definitions given above can be rewritten in a simpler way: a deterministic, finite-state automaton is a 5-tuple , where Q is a set of states, is the starting state, is a set of final states, is the alphabet , and is a partial mapping denoting transitions. We extend the mapping to as in [HU79]. The definitions of the language accepted by an automaton, the right language of a state, and of the property for DAFSA become simpler:

It is possible to use alternative definition of an FSA: , where the meaning of *Q*, , *i*, and *F* is
the same as above, and *E* is a relation -
an alternative expression of the function from above. This
alternative definition may be more useful in descriptions below.

The algorithms described in this paper operate on classical
DAFSA (deterministic acyclic finite-state automata). However, our
implementation uses a different kind of automata:
*finite-state automata with final transitions (FSA-FT)*
.
We define such automaton as , where *Q* is the
finite set of states, is the alphabet , is the set of the initial states of the automaton, is a set of transitions, and
is the set of final transitions. We say that the FSA-FT accepts a word
if for each symbol from the input a transition was found, and the last
transition it traversed while recognizing the word was final.

We define a transformation of a classical finite-state
automaton into a finite-state automaton with final transitions: for every
FSA we find an FSA-FT ,
where , *i*' is the new initial state, , and .

Let to be the transitive closure of F defined by the following recursive relation:

- if then
- if then

Now we can define the language of the FSA-FT to be:

For every deterministic acyclic finite-state automaton
that does not accept empty input (written as ),
i.e. , or , we can find a deterministic acyclic finite-state automaton with
final transitions (DAFSA-FT) ,
, where . We will call this transformation *moving
markers* . It preserves the number of states and
the number of transitions. Because such a transformation exists, and
it is defined for the minimal
automata , we can write that
.

Minimal deterministic acyclic finite-state automata with final transitions
can indeed be smaller than their classical counterparts (see
fig. 2.4). The definition of
minimality for an FSA-FT is the same as for the
classical FSA, but the definition of the *right
language* of a state
in FSA-FT must change:

**Figure 2.4:** Classical minimal deterministic acyclic FSA and the
corresponding FSA with final transitions

Note that for DAFSA-FT . This is different from the
classical automata, where . That difference is very
important. It means that the recognition of a word is moved from a
final state in the classical automaton to the transition *reaching
that state* in the FSA-FT, resulting in the removal of the empty
string () from the right language of that automaton. Let be a DAFSA (a classical one), and be a DAFSA-FT obtained by the transformation
*moving markers* . We can write:

If we can find two states in the classical FSA that have the same right languages except for the empty string, we can merge the states together in a deterministic acyclic finite-state automaton with final transitions. This makes the resulting automaton smaller, i.e. it has less states and less transitions than the corresponding classical finite-state automaton.

Note that as most implementations of automata have implicit states (only transitions are represented explicitly), this new kind of automata saves space at no cost.

The automata with final transitions can be seen as Mealy's automata where the property of being final is the output label. Therefore, they can be perceived more as a notational convention than a new kind of automata. However in transducers, that property would have to be concatenated with the output labels.

Wed Jun 3 14:37:17 CEST 1998

Software at http://www.pg.gda.pl/~jandac/fsa.html