Table of Contents

NAME

data format for fsa_morph - morphology analysis program

DESCRIPTION

fsa_morph performs morphological analysis using one or more dictionaries in form of a finite-state automata. They are built using fsa_build or fsa_ubuild. Those programs accept data as lists of strings of characters, one string being one line of input. It is up to fsa_morph to interpret the data in the automaton. The strings are formated so that the morphological analysis can be performed. The formats depend on characteristics of the language. there is one general format, but other formats make resulting automata smaller for certain classes of languages.

GENERAL FORMAT

The general format is word+Kending+annotations, where
word
is the inflected form that should be recognized and analysed when encountered. + is the annotation separator (it can be changed to a different character using an option of fsa_build or fsa_ubuild).
K
is a code that says how many characters to delete from the end of the inflected form. A means 0, B means 1, C means 2, and so on.
is the ending that should be appended to the inflected form after some
characters from its end have been deleted in order to obtain the base form (the canonical form).
is the annotation separator.
are normally the categories of the inflected form.

IMPOVERISHED FORMAT

This format can be used if fsa_morph has been compiled with POOR_MORPH compile option. Then you can use -A runtime option of fsa_morph. This format makes it possible to obtain only categories, and not base forms, for given inflected forms. It does not contain the information about the base form, i.e. the +Kending part, so it is only word+annotations

FORMAT WITH ENCODED PREFIXES

For languages with prefixes, the general format can still be used. However, the format with coded prefixes may lead to smaller automata. The format contains an additional character that codes the number of characters to be deleted from the beginning of the word. The format is: word+LKending+annotations, where the items known form the general format keep their meaning, and:
L
is the code, with A meaning there is nothing to be deleted at the beginning, B meaning there is 1 character to be deleted, C - 2, etc.

FORMAT WITH ENCODED PREFIXES AND INFIXES

For languages with infixes, the general format can still be used. However, the format with coded prefixes and infixes may lead to smaller automata. The format contains two additional characters that code how many characters to delete, and the position of those characters at the beginning of the inflected form. The format is word+KLMending+annotations, where the items known form the general format keep their meaning, and
K
informs about the position from the beginning of the inflected form where some characters should be deleted. A means the characters are at the beginning of the word (it is a prefix), B means the characters to be deleted start at the second character of the word, C - the second one, and so on.
L
says how many characters to delete. A means none, B means one C means 3, and so on.

See provided scripts: morph_infix.pl and morph_infix.awk.

SEE ALSO

fsa_accent(1) , fsa_build(1) fsa_guess(1) , fsa_guess(5) , fsa_hash(1) , fsa_morph(5) , fsa_prefix(1) , fsa_spell(1) , fsa_ubuild(1) , fsa_visual(1) .


Table of Contents