Author Topic: 2 cwl-aspell*.trid.xml for aspell compressed word list  (Read 712 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
2 cwl-aspell*.trid.xml for aspell compressed word list
« on: October 05, 2023, 03:43:32 AM »
Hello trid users,
   
some days ago i handled some aspell dictionary files. In this session i will
only consider aspell compressed word list with CWL suffix. This are used by
aspell software (See Wikipedia page https://en.wikipedia.org/wiki/GNU_Aspell).

The aspell variant samples on UNIX like systems are typically found inside
directory like /usr/share/aspell. Although called compressed word list the CWL
samples i inspected get a second compression step. For my samples this was
gzip. So the samples have file name extension like .cwl.gz.

So i run trid utility on my CWL examples. All samples are described as
"Unknown!" without mime type and file name suffix (see appended trid-v-old.txt
in output).

For comparison reason i also run the file format identification utility DROID
(See https://sourceforge.net/projects/droid/). It does not recognize the
samples.

For comparison reason i also run file command (version 5.45) on such
samples. Here the samples are also not recognized and described generic as
"data" (see appended output/file-5.45.txt) with generic mime type
application/octet-stream (see appended file-i-5.45.txt in output) and no file
name suffix (see appended file-ext-5.45.txt in output). The corresponding
uncompressed word list (*.wl) are described as "text" with generic mime type
text/plain and no file name suffix. Interesting is that some samples like
nl.wl or uk.wl are described with sub classification as "Unicode text, UTF-8
text", whereas other word list are described as "ISO-8859 text". That is a
hint that 2 variants for CWL samples exist.

Unfortunately in the aspell documentation you find no explicit file format
specification of CWL files.

On the man page word-list-compress(1) part of the aspell package is shown in
example section how aspell dictionary (*.rws) is created from compressed word
list (*.CWL) by a command like:
   word-list-compress d <words.cwl | aspell create master ./words.rws

There also is described how the original word list (*.wl) can be recreated by
a command like:
   word-list-compress d < fr-60-only.cwl  > fr-60-only.wl

So i choose this manual page again as reference. So that is expressed by line
like:
   <RefURL>https://linux.die.net/man/1/word-list-compress</RefURL>

Instead of generic mime type application/octet-stream i choose an user defined
one. That is expressed by line like:
   <Mime>application/x-aspell-wordlist</Mime>

After running tridscan generating cwl-aspell.trid.xml with few examples the
first XML construct looks like:
   <Bytes>0141</Bytes>
   <ASCII> . A</ASCII>
   <Pos>0</Pos>

Then i get a few for XML construct like:
   <Pattern>
      <Bytes>69</Bytes>
      <ASCII> i</ASCII>
      <Pos>49</Pos>
   </Pattern>
   <Pattern>
      <Bytes>6E</Bytes>
      <ASCII> n</ASCII>
      <Pos>1722</Pos>
   </Pattern>

And in global strings section i get lines like:
   <String>DIMENSIONNEL</String>
   <String>DIRECTIONNEL</String>
   ...
   <String>YLLE</String>
   <String>YLON</String>
   <String>YLOR</String>
   <String>YLVA</String>
   <String>YPHE</String>
   <String>YPTA</String>
   <String>YRIE</String>
   <String>ZIST</String>
   <String>ZOLI</String>

After running on more examples (6) with other non European languages like
ku.cwl the lines inside Global Strings section vanished and in front block
only the first construct survived. When i tried to run tridscan on sample
uk.cwl the XML construct begin to shrink even more. So i stopped here this
definition.

For many CWL when running word-list-compress like
   word-list-compress d < nl.cwl  > nl.wl
I get error message:
   ERROR: Corrupt Input.

As described in man page (I was too stupid to read it carefully for one day)
if the input file is a compressed word list but you have no output file, then
it may be a newer prezip-bin(1) version of compressed file, if so, try
decompressing the file with prezip-bin(1) instead, because word-list-compress
accepts up to 255 text characters in the range of {0x21...0xFF}.

So i run as proposed on such "corrupt" CWL samples a command like:
   prezip-bin -d < nl.cwl  >  nl.wl

After running tridscan generating cwl-aspell-prezip.trid.xml with 36 examples
the first and only XML construct looks like:
   <Pattern>
      <Bytes>0200</Bytes>
      <Pos>0</Pos>
   </Pattern>

So i choose here the manual page prezip-bin(1) as reference. So that is
expressed by line like:
   <RefURL>https://manpages.debian.org/stable/aspell/prezip-bin.1.en.html</RefURL>

Luckily aspell is open source. So i looked inside sources of aspell version
0.60.8. So i see in prezip.c a format description like:
 * Format:
 *   <data> ::= 0x02 <line>+ 0x1F 0xFF
 *   <line> ::= <prefix> <rest>*
 *   <prefix> ::= 0x00..0x1D | 0x1E 0xFF* 0x00..0xFE
 *   <rest> ::= 0x20..0xFF | <escape>
 *   <escape> ::= 0x1F 0x20..0x3F
So i know for sure that the prezip variant always start with 0x02 and that the
last bytes are 0x1F 0xFF. Not very unique enough but better than nothing.

With the two new trid definitions now many of my CWL samples (except uk.cwl)
are described. TrID definition, some samples and output are stored in archive
spell-mint.zip. I hope that my definitions can be used in future version of
triddefs.

Unfortunately there exist also other word list/dictionary with other
file formats and file name suffix for aspell. Also other spelling
software like ispell and hunspell use other dictionary file formats. I
will try to handle these in a future session.

With best wishes
Jörg Jenderek

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Re: 2 cwl-aspell*.trid.xml for aspell compressed word list
« Reply #1 on: October 05, 2023, 03:15:48 PM »
Thanks for the update, but I get a lot of mis-identification of other formats with these 2 new defs - the 2 bytes header is indeed a bit too generic to be reliable.