If the automata are to be used to predict only the categories, each line of the input to fsa_build should contain inverted word with the beginning (the end when inverted) of the word marked with the filler character, followed by an annotation separator, and followed by tags. See prep_atg.awk script available in the package.To treat such dictionaries fsa_guess should either not be compiled with GUESS_LEXEMES compile option, or it should be called with -p and -g options. The standard name extension for dictionaries prepared in this way is atg.
If fsa_guess is to guess also lexemes, it must be compiled with GUESS_LEXEMES compile option, and the input to fsa_build must contain in each line: the inflected form, annotation separator, a code, lexeme ending, annotation separator, and tags (annotations). The code specifies how many characters from the end of the inflected form must be deleted before appending there the lexeme ending to get the lexeme. It is one character. To calculate the number, take the character code and substract 65 (character code for 'A') from it. See prep_atl.awk script available in the package. The standard name extension for automata prepared in this way is atl.
To make fsa_guess take into account information included in prefixes, it must be compiled with GUESS_PREFIX. In data lines for fsa_build, the first annotation separator is replaced by two annotation separators for entries that do not contain prefixes, otherwise the prefix is deleted from the inverted inflected form leaving the filler character, and placed between the two annotation marks. The prefix is stored as is, i.e. not inverted. The standard name extension for automata prepared in this way is atp.
If fsa_morph is to predict morphological descriptions for mmorph, it must be compiled with GUESS_MMORPH. To see whether fsa_guess was compiled with that option, invoke it with -v. The format for fsa_build takes the format: inverted_+K1e1+K2K3K4e2+a1+categories, where inverted is an inverted inflected form, K1 is a character code describing how many characters should be deleted from the end of the inflected form (coded as explained above) in order to get the canonical form (possibly followed by an ending), e1 is the ending that should be appended to the inflected form to obtain the lexeme (more precisely: the canonical or base form) after the ending of the inflected form has been stripped. K2 is a character code describing how many characters should be deleted from the end of the canonical form (coded as explained above) in order to get the lexical form (possibly followed by an ending). K3 is a character code indicating the position of an archphoneme in the lexical form. If no archphoneme is present, the code is 'A'. Otherwise the code is 'B' for the last character, 'C' -- for the penultimate one, and so on (after removal of K2 chars). K4 says how many characters the archphoneme replaces. 'A' means 0, 'B' -- 1, etc. This code is present only when K3 is not 'A', i.e. when there is an archphoneme. e2 is the ending of the lexical form; it should be appended to what is left from the canonical form after removing K2 characters from the end, and replacing some characters by an archphoneme if needed to obtain the lexical form. a1 is the archphoneme (as specified in mmorph). '+' is the annotation separator. It is stored in the header of a dictionary, and can be specified as an option to fsa_build.
Table of Contents