Author Topic: rws-aspell.trid.xml for aspell dictionary  (Read 707 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
rws-aspell.trid.xml for aspell dictionary
« on: September 24, 2023, 12:17:40 AM »
Hello trid users,
   
some days ago i handled some spell affix files. In this session i will only
consider spell dictionary with RWS suffix. This are used/created by aspell
software ( See Wikipedia page https://en.wikipedia.org/wiki/GNU_Aspell).

The aspell variant samples on UNIX like systems are typical are typical found
inside directory like /usr/lib/aspell or /usr/lib/aspell-0.60 and some times
/var/lib/aspell. In last directory the samples are normally created from word
list files (*.wl) or compressed word list files (*.cwl*) during package
installation by aspell command with "create master" option.

Luckily on such systems there exist a package management. So there program
needing such spelling often include the needed dictionary files by depending
on aspell packages. Unfortunately on Windows systems there exist no such
package management. So here every software with aspelling included such
dictionary files inside it own program directory. Software that behave in this
manner are: Inkscape, Bluefish, Aspell. So on Windows systems i found such RWS
samples in directories like:
   c:\Program Files (x86)\Aspell\dict
   c:\Programme\Bluefish\lib\Aspell-0.60
   c:\Program Files\Inkscape\lib\aspell-0.60

So i run trid utility on my RWS examples. All samples are described as
"Unknown!" without mime type and file name suffix (see appended trid-v-old.txt
in output).

For comparison reason i also run the file format identification utility DROID
(See https://sourceforge.net/projects/droid/). It does recognize the samples
but describe these wrong as "Revit Workspace" by PUID x-fmt/448 because
recognition happens by RWS file name extension.

For comparison reason i also run file command (version 5.45) on such
samples. Here the samples are also not recognized and described generic as
"data" (see appended output/file-5.45.txt) with generic mime type
application/octet-stream (see appended file-i-5.45.txt in output) and no file
name suffix (see appended file-ext-5.45.txt in output).

Unfortunately in the aspell documentation you find no explicit file format
specification of RWS files. Even the RWS suffix is rarely mentioned. In the
man page aspell-autobuildhash(8) part of dictionaries-common package on Linux
Mint the standard location directories are mentioned. There is also mentioned
that the RWS samples are created from $lang.cwl.gz or $lang.mwl.gz, but the
procedure is not described in detail. Whereas on the man page
word-list-compress(1) part of the aspell package this is shown in example
section by command like:
   word-list-compress d <words.cwl | aspell create master ./words.rws

So i choose this manual page as reference. So that is expressed by line like:
   <RefURL>https://linux.die.net/man/1/word-list-compress</RefURL>

Instead of generic mime type application/octet-stream i choose an user defined
one. That is expressed by line like:
   <Mime>application/x-aspell-dictionary</Mime>

After running tridscan generating rws-aspell.trid.xml with few examples the
first XML construct looks like:
   <Bytes>617370656C6C2064656661756C74207370656C6C657220726F776C20312E313000
   <ASCII> a s p e l l   d e f a u l t   s p e l l e r   r o w l   1 . 1 0 .
   <Pos>0</Pos>

Luckily aspell is open source. So i looked inside sources of aspell version
0.60.8. So i see in readonly_ws.cpp that this is generated by 32 byte constant
string cur_check_word which is equal "aspell default speller rowl 1.10". After
running tridscan on more and older examples the first XML construct becomes
like:
   <Bytes>617370656C6C2064656661756C74207370656C6C657220726F776C20312E</Bytes>
   <ASCII> a s p e l l   d e f a u l t   s p e l l e r   r o w l   1</ASCII>
   <Pos>0</Pos>
because in older variants at offset 28 "version" string is 1.4 instead of
1.10. So i mention this in remark line.

Assuming that there exist variants with version string unequal 1.x this is
general expressed by XML construct like:
   <Bytes>617370656C6C2064656661756C74207370656C6C657220726F776C</Bytes>
   <ASCII> a s p e l l   d e f a u l t   s p e l l e r   r o w l</ASCII>
   <Pos>0</Pos>
When i look inside source data.cpp i see that constancy check is done by
looking for this 27 byte string of head via strncmp function.

When looking in readonly_ws.cpp i see that first part of DataHead structure is
declared as character string check_word[64]. Apparently only the first dozen
bytes are used for the above mentioned checked words so the remaining bytes
are unused and are filled with nil bytes. So these observations are expressed
by second XML construct like:

 <Bytes>0000000000000000000000000000000000000000000000000000000000000000</Bytes>
 <Pos>32</Pos>

After the first structure at offset 64 the variable section starts with
endian_check variable. For little endian this is decimal 12345678 or 00BC614E
hexadecimal in little endian. That is byte sequence 4e61bc00 or Na\274\0
string. At offset 144 last variable is freq_info. Many of these variables are
of type u32int. Often the corresponding values are "low". So the upper bytes
are nil. That was expressed by short nil XML constructs like:
   <Pattern>
      <Bytes>0000</Bytes>
      <Pos>67</Pos>
   </Pattern>
   <Pattern>
      <Bytes>0000</Bytes>
      <Pos>71</Pos>
   </Pattern>
   ...
   <Pattern>
      <Bytes>0000000000</Bytes>
      <Pos>143</Pos>
   </Pattern>
Some variables like head_size, dict_name_size, lang_name_size will probably
always be "low" for all examples. But i assume that some variables like
word_offset, hash_offset, word_count can reach 4 GB limit. So assuming also
big endian variant and variables reaching 4 GB limit i delete the above
mentioned XML constructs.

After the header i get at higher offsets some short nil sequences like:
   <Pattern>
      <Bytes>0000000000</Bytes>
      <Pos>179</Pos>
   </Pattern>
   <Pattern>
      <Bytes>0000</Bytes>
      <Pos>186</Pos>
   </Pattern>
   ...
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>991</Pos>
   </Pattern>
I assume that these are triggered by lucky circumstances ( too few
examples). So i also deleted these XML constructs.

With the updated trid definitions now my RWS samples are described. TrID
definition, some samples and output are stored in archive rws_.zip. I hope
that my definition can be used in future version of triddefs.

Unfortunately there exist also other word list/dictionary with other
file formats and file name suffix for aspell. Also other spelling
software like ispell and hunspell use other dictionary file formats. I
will try to handle these in a future session.

With best wishes
Jörg Jenderek


Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Re: rws-aspell.trid.xml for aspell dictionary
« Reply #1 on: September 24, 2023, 06:33:28 PM »
Thanks!