Author Topic: replaced gmo.trid.xml for GNU Gettext (.mo *.gmo) + big endian variants (Read 9573 times)

jenderek · « **on:** January 10, 2018, 07:50:32 PM »

Hello,

by accident i has to handle some GNU message catalog files.
When i run trid on hundreds of files with that extensions (*.mo and *.gmo) i
found some files like avahi-en_CA.mo or de-diffutils.mo which are not
recognized (see appended output/trid-old.txt and be/output/trid-old.txt).

The "Unknown" classified examples are recognized by newest file(1) command
(see output/file-new.txt)

When we look in output of file command we see that in most cases translation
element number 0 starts with phrase Project-Id-Version: followed by name and
version of software like "Project-Id-Version: Filezilla 3" in example
filezilla-fr_CA.mo. But sometimes software name and version is missing like
"Project-Id-Version: fr_FR" in filezilla-fr.mo. So these examples are
recognized by gmo.trid.xml which contains in global string section line
   <String>PROJECT-ID-VERSION</String>

In rare cases i found examples like avahi-en_CA.mo which does not contain
this phrase. This example starts with another phrase like
"Report-Msgid-Bugs-To:". So this example is then not recognized by trid.

It is like good coding style in c-sources. The source of gmo files contain
typically meta information which are then described in global string section
with lines like:
   <String>PO-REVISION-DATE</String>
   <String>LAST-TRANSLATOR</String>

The usage of such key words is only a recommendation. It is not a must have.
So we find some examples where some of these key words are missing. So i
could run tridscan to eliminate in global string section typical keywords
for unidentified examples.

Finally i found 7 extreme examples like overflow-6.mo which contains no
strings because it contains 0 messages. When such examples have no message
it contains no translation strings. So this can not really be used as
translation catalog but such files ( found for example in gettext sources)
can be used to test/verify gettext tools for example. So such examples are
valid GNU gettext machine objects and are only recognised by trid if
definition file contains finally no global string section any more. So an
updated trid definition file would only contain 1 pattern expressed by XML
construct:
   <Bytes>DE120495</Bytes>
   <Pos>0</Pos>

So i decide to create replacement for trid definition file. I add now also a
mime type by line:
   <Mime>application/x-gettext-translation</Mime>

Instead gettext homepage URL i use the more specific section "The Format of GNU
MO Files" in GNU gettext manual by line:
<RefURL>http://www.gnu.org/software/gettext/manual/gettext.html#MO-Files</RefURL>

According to that information at offset 4 revision is stored as an unsigned
32-bit, which is splitted in a major and a minor part, where value 0 or 1
can appear. So 4 possible revision can occur, but in real world examples i
only found 3 revisions (0.0 0.1 1.1). So in an generic trid definition at
least 2 bytes are null. But i create 2 trid definitions; gmo-v0.trid.xml
for minor revision 0 which is expressed by pattern constructs:
   <Bytes>DE1204950000</Bytes>
   <Pos>0</Pos>
   ...
   <Bytes>00</Bytes>
   <Pos>7</Pos>

And gmo-v1.trid.xml for minor revision 1 which is expressed by pattern
constructs:
   <Bytes>DE1204950100</Bytes>
   <Pos>0</Pos>
   ...
   <Bytes>00</Bytes>
   <Pos>7</Pos>

According to gettext manual the mo header has a size of 28 (1Ch) bytes. But
that is only half of the truth. When looking in header file gmo.h found in
source of gettext there it is written that additional variables are stored
after offset 1Ch for minor revision >= 1. Counting variables and their
sizes for this case the header size is 48 (30h) bytes.

For all inspected examples the table with original strings information
comes directly after mo header. This means variable orig_tab_offset
has value 1Ch for revision x.0 and value 30h for revision x.1.
This is expressed in gmo-v0.trid.xml by additional XML construct:
   <Bytes>1C000000</Bytes>
   <Pos>12</Pos>
and in gmo-v1.trid.xml for minor revision 1 this is expressed by pattern
   <Bytes>30000000</Bytes>
   <Pos>12</Pos>

With 2 new trid definition files now above mentioned inspected examples are
recognized (see appended output/trid-new.txt). But according to output of
file command and reference only little endian variant is described by above
pattern. So i changed in gmo-v0.trid.xml file type to text like "GNU Gettext
Machine Object file (little endian v x.0)" and do similar in gmo-v1.trid.xml.

So i look also for big endian variants. I found 24 examples for minor
revision 0. All such examples are not recognized by gmo.trid.xml ( see
be/output/trid-old.txt). On the other hand these examples are identified
correctly by file command as "GNU message catalog (big endian)" ( see
be/output/file-new.txt). So create for "GNU Gettext Machine Object file (big
endian v x.0)" definition file gmo-v0-be.trid.xml with XML constructs:
   <Bytes>950412DE00</Bytes>
   <Pos>0</Pos>
   ...
   <Bytes>0000</Bytes>
   <Pos>6</Pos>
   ...
   <Bytes>0000001C</Bytes>
   <Pos>12</Pos>
For "GNU Gettext Machine Object file (big endian v x.1)" i found no
examples. So create gmo-v1-be.trid.xml by just looking at definition for
little endian and afterwards swapping byte order. Then this expressed by XML
constructs:
   <Bytes>950412DE00</Bytes>
   <Pos>0</Pos>
   ...
   <Bytes>0001</Bytes>
   <Pos>6</Pos>
   ...
   <Bytes>00000030</Bytes>
   <Pos>12</Pos>

With third and fourth trid definition files now also big endian variant
files are recognized (see appended be/output/trid-new.txt). TrID definition,
some examples and output are stored in archive _mo.zip. I hope that the 4
XML files can be used in future version of triddefs.

With best wishes
J?rg Jenderek

Mark0 · « **Reply #1 on:** January 10, 2018, 11:48:20 PM »

Thanks Joerg for the new defs!
Maybe I'll keep just a generic LE (as an update of the current one) and BE (as a brand new def).

Mark0's Forum

News: