Table of Contents
data format for fsa_morph - morphology analysis program
fsa_morph performs morphological analysis using one or more dictionaries
in form of a finite-state automata. They are built using fsa_build or fsa_ubuild.
Those programs accept data as lists of strings of characters, one string
being one line of input. It is up to fsa_morph to interpret the data in
the automaton. The strings are formated so that the morphological analysis
can be performed. The formats depend on characteristics of the language.
there is one general format, but other formats make resulting automata
smaller for certain classes of languages.
The general format
is word+Kending+annotations, where
- word
- is the inflected form that should
be recognized and analysed when encountered. + is the annotation separator
(it can be changed to a different character using an option of fsa_build
or fsa_ubuild).
- K
- is a code that says how many characters to delete from
the end of the inflected form. A means 0, B means 1, C means 2, and so
on.
- is the ending that should be appended to the inflected form after some
- characters from its end have been deleted in order to obtain the base
form (the canonical form).
- is the annotation separator.
- are normally the
categories of the inflected form.
This format can
be used if fsa_morph has been compiled with POOR_MORPH compile option.
Then you can use -A runtime option of fsa_morph. This format makes it possible
to obtain only categories, and not base forms, for given inflected forms.
It does not contain the information about the base form, i.e. the +Kending
part, so it is only word+annotations
For
languages with prefixes, the general format can still be used. However,
the format with coded prefixes may lead to smaller automata. The format
contains an additional character that codes the number of characters to
be deleted from the beginning of the word. The format is: word+LKending+annotations,
where the items known form the general format keep their meaning, and:
- L
- is the code, with A meaning there is nothing to be deleted at the beginning,
B meaning there is 1 character to be deleted, C - 2, etc.
For languages with infixes, the general format can
still be used. However, the format with coded prefixes and infixes may
lead to smaller automata. The format contains two additional characters
that code how many characters to delete, and the position of those characters
at the beginning of the inflected form. The format is word+KLMending+annotations,
where the items known form the general format keep their meaning, and
- K
- informs about the position from the beginning of the inflected form
where some characters should be deleted. A means the characters are at
the beginning of the word (it is a prefix), B means the characters to
be deleted start at the second character of the word, C - the second one,
and so on.
- L
- says how many characters to delete. A means none, B means
one C means 3, and so on.
See provided scripts: morph_infix.pl and morph_infix.awk.
fsa_accent(1)
, fsa_build(1)
fsa_guess(1)
, fsa_guess(5)
, fsa_hash(1)
,
fsa_morph(5)
, fsa_prefix(1)
, fsa_spell(1)
, fsa_ubuild(1)
, fsa_visual(1)
.
Table of Contents