Author Topic: updated affix.trid.xml for Affix file + 2 variants without Russian support  (Read 715 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
Hello trid users,

some days i run ccleaner cleanup tool on Windows. One option is called "Unused
File Extension". When i use this option it complains about file name suffix
AFF. So i looked on my systems for such files. Unfortunately this suffix is
used by different file types. In this session i will only consider Affix text
samples.  The Ispell variant samples are typical found inside directory
/usr/lib/ispell on UNIX like systems. The myspell variant samples are typical
found inside directory /usr/share/myspell on UNIX like systems.  The Hunspell
variant samples are typical found inside directory /usr/share/hunspell on UNIX
like systems. But such samples are also found beneath directory
/usr/src/dicts. Luckily on such systems there exist a package management. So
there program needing spelling often include the needed affix files by
depending on spell packages.  Unfortunately on Windows systems there exist no
such package management.  So here every software with spelling included such
affix definition inside it own program directory. Software that behave in this
manner are: Calibre, LibreOffice, Scribus, LanguageTool, Firefox, Thunderbird,
gImageReader, Emacs, Gramps.

So i run trid utility on my AFF examples and related files. Few samples are
described as "Affix file" by affix.trid.xml without mime type.  (see appended
trid-v-old.txt in output). But many examples are described as "Unknown!" (see
appended trid-v-old.txt in aff-iso/output aff-utf8/output aff_other/output).

For comparison reason i also run the file format identification utility DROID
(See https://sourceforge.net/projects/droid/). It does not recognize the
samples.

For comparison reason i also run file command (version 5.45 and newer ispell,v
1.9 2023/07/3) on such samples. Here the samples recognized by TrID are
described as "text" (see appended output/file-5.45.txt) with generic mime type
text/plain (see appended file-i-5.45.txt in output) and no file name suffix
(see appended file-ext-5.45.txt in output). With newer version the samples are
described as "affix definition" with sub classification "for MySpell/Hunspell"
(see appended output/file.txt). Now mime type text/x-affix is shown (see
appended output/file-i.txt) and correct suffix "aff" is shown (see appended
output/file-ext.txt). Furthermore it shows the first lines of AFF
samples. These starts with lines like:
SET ISO8859-1

With the help of the tools i found manual pages with section about file
formats and conventions for ispell, Hunspell dictionaries and affix files. So
this is now expressed inside affix.trid.xml by additional line like:
   <RefURL>https://man.archlinux.org/man/hunspell.5.en</RefURL>

According to that man page the recognized samples are not Ispell variant and
start with SET instruction at the beginning. So i mention these facts in a
remark line. According to hunspell man page this sets the character encoding
of words and morphemes in affix and dictionary files. Possible values are:
UTF-8 ISO8859-1 - ISO8859-10 ISO8859-13 - ISO8859-15 KOI8-R KOI8-U cp1251
ISCII-DEVANAGARI.

Unfortunately there exist no strict and unique pattern that can be used as
magic pattern. Unfortunately this SET directive does not comes always at the
beginning.

Often the separator is 1 space character (0x20), but sometimes a tabulator
character (0x09) is used like in /opt/Wolfram/WolframEngine/
13.1/SystemFiles/Components/SpellingData/SpellingDictionaries/ar.aff.

Unfortunately similar looking phrase occur in some Ispell affix. So in
/usr/lib/ispell/ngerman.aff i found a line like
# PARTICULAR SETTINGS FOR ISPELL ARE NECESSARY !!!
and in /usr/lib/ispell/ogerman.aff i found a line starting like:
#   sS            >    -sS,SSET    #     schosS    >

Many samples like /usr/share/calibre/dictionaries/en-GB/en-GB.aff start with a
comment line (So first character is hash sign #) and the SET directive comes
later. Then i must also explicitly check for encoding string in order to skip
some scripts (like /bin/affixcompress /bin/setupcon /bin/imdbpy2sql.py).

For many samples the SET argument is UTF-8. So i run tridscan on such samples
to generate affix-rem.trid.xml.

That it starts with a remark or comment line is expressed inside Front Block
section by XML construct like:
   <Bytes>23</Bytes>
   <ASCII> #</ASCII>
   <Pos>0</Pos>

The UTF-8 encoding is expressed inside Global strings section by line like:
   <String>SET UTF-8</String>

For many samples the SET argument is ISO8859-1. So i run tridscan on such
samples to generate affix-rem-iso8859-1.trid.xml.
That it starts with a remark or comment line is expressed inside Front Block
section by XML construct like:
   <Bytes>23</Bytes>
   <ASCII> #</ASCII>
   <Pos>0</Pos>
The ISO8859-1 encoding is expressed inside global strings section by line
like:
   <String>SET ISO8859-1</String>

With the updated trid definitions now many AFF samples are described. TrID
definitions and output are stored in archive aff_trid.zip. I hope that my
definitions can be used in future version of triddefs.

Because there exist no strict and unique pattern so you find for every variant
at least one example which is not matched ( See appended
aff_other/output/file.txt).

According to man page beside UTF-8 and ISO8859-1 others encoding can be
used. I myself do not found such samples in my inspected examples. Instead of
UTF-8 i must also check for other coding like KOI8-R KOI8-U cp1251, but i am
not willing to support Cyrillic alphabet until Russia make war against
Ukraine. So i do not implement the branches for Russian and Cyrillic
encodings.

Unfortunately a few Hunspell samples like 1463589.aff, 1695964.aff and
2970240.aff found as test unit inside Thunderbird sources does not contain the
typical keywords (SET LANG). I could try to implement for every of such
samples an exception but then the i get dozen of definitions for one example
that is not found on "normal" systems. I hope this is a hint for developer of
such test affix to just simply add a SET directive in such test affix files.

Some samples like en-GB-cal.aff (found as en-GB.aff as part of calibre
software) start with a Byte Order Mark (BOM=\xEF\xBB\xBF).

Some samples does neither start wit SET directive nor with a comment line
(like ar.aff tr_TR.aff).  These samples start with other directives (like FLAG
LANG).

Some samples (like bulgarian.aff ngerman.aff polish.aff) are not matched
because these are Ispell variant with other keywords (like defstringtype and
suffixes followed by flag).

With best wishes
Jörg Jenderek


Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Thanks for this one. Checking it out, I think even my original/old definitions was a bit too generic to be useful, and they all can't mach all the files. It seems that's just not a format that lend itself to a quick identification.
I'll remove it for the moment.