The input data should be a list of inverted words with annotations. Each line should contain an inverted word (i.e. the first character should be the last character of the word, the second one - the penultimate one, and so on. This inverted word should be followed immediately by a filler character and an annotation separator, and then by grammatical annotations. They specify some morphosyntactic properties of words, such as number, gender, etc.
Assuming that a file file contains data in 3 columns: inflected word, canonical form, annotations, the following incantation:
awk '{s="";for(i=1;i<=length($1);i++)s=substr($1,i,1) s;printf "%s_+%s\n",s,$3;}' file | sort -u > file.idx
prepares data for the a tergo index. The incantation should be all in one line. For more detail see the contents of prep_atg.awk file included in the distribution. The standard name extension for automata prepared in this way is atg.
The format is as follows:
inverted_+Kending+annotations
where:
For more detail see prep_atl.awk or prep_atl.pl files included in the distribution. The standard name extension for automata prepared in this way is atl. Note that the same format can be used for lemmatization. Just omit the final annotation separator and annotations.
The format is as follows:
verted_+prefix+Kending+annotations
where:
For more detail, see provided scripts atl_prefix.awk and atl_prefix.pl.
The standard name extension for automata prepared in this way is atp.
This format is used to predict morphological descriptions for mmorph. Each line for fsa_build or fsa_ubuild for building an automaton to be used by fsa_guess takes the format: inverted_+K1e1+K2K3K4e2+a1+categories, where inverted is an inverted inflected form, K1 is a character code describing how many characters should be deleted from the end of the inflected form (coded as explained above) in order to get the canonical form (possibly followed by an ending), e1 is the ending that should be appended to the inflected form to obtain the lexeme (more precisely: the canonical or base form) after the ending of the inflected form has been stripped. K2 is a character code describing how many characters should be deleted from the end of the canonical form (coded as explained above) in order to get the lexical form (possibly followed by an ending). K3 is a character code indicating the position of an archphoneme in the lexical form. If no archphoneme is present, the code is 'A'. Otherwise the code is 'B' for the last character, 'C' -- for the penultimate one, and so on (after removal of K2 chars). K4 says how many characters the archphoneme replaces. 'A' means 0, 'B' -- 1, etc. This code is present only when K3 is not 'A', i.e. when there is an archphoneme. e2 is the ending of the lexical form; it should be appended to what is left from the canonical form after removing K2 characters from the end, and replacing some characters by an archphoneme if needed to obtain the lexical form. a1 is the archphoneme (as specified in mmorph). '+' is the annotation separator. It is stored in the header of a dictionary, and can be specified as an option to fsa_build.