Author Topic: 2 tfm-tex-0x01*.trid.xml for TeX Font Metric; variants with lh=0x01  (Read 10 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 371
Hello trid users,

some days ago i looked at the content of an exotic CD-ROM. There are also
stored samples which are misidentified. The samples have TFM file name suffix

There exist other variants. So in this session i will handle only variant with
lh=0x01. I will explain later what this means. I found few of such samples
(like gbm.tfm gbmv.tfm rml.tfm rmlv.tfm) in dvips directory with parent
directory ptex-fonts inside fonts\tfm sub directory tfm) after installing
MiKTeX version 23.12 on Windows. On Linux Mint 21.3 i found thousands of
samples with font family name as part of texlive-lang-japanese package with
version 2021.20220204-1. Hundreds of samples without coding and font family
names i found in packages (like dvi2ps-fontdata-ja dvi2ps-fontdata-rsp
dvi2ps-fontdata-tbank dvi2ps-fontdata-three dvi2ps-fontdata-ptexfake
texlive-lang-chinese).

So i run trid utility on my TFM samples with lh=0x01. The samples are not
recognized. Many are described wrong as "Adobe PhotoShop Brush" by
abr.trid.xml with file name suffix (.ABR).

For comparison reason i also run the file format identification utility DROID
(See https://sourceforge.net/projects/droid/). Here no sample is recognized.

For comparison reason i also run file command (version 5.45) on such
samples. Here these TFM samples are not recognized and not described as "TeX
font metric data". These are described as "data" (see appended file-k-5.45.txt
in nonames/output and output). For theses TFM samples here mime type
application/x-tex-tfm is not shown (see appended file-i-5.45.txt in
nonames/output and output). Here no file name suffix is shown (see appended
file-ext-5.45.txt in nonames/output and output).

Luckily i found page about TeX Font Metrics on file formats archive team web
site and on Wikipedia. So i use the first because the Wikipedia link is there
also mentioned and furthermore link to download samples are here listed. So
the reference URL in new definition is expressed by line like:
 <RefURL>http://fileformats.archiveteam.org/wiki/TeX_Font_Metrics/RefURL>

So i run tridscan on my samples to generate tfm-tex-0x01-names.trid.xml and
tfm-tex-0x01.trid.xml. Afterwards i tried to understand the generated
constructs and look if these are always true. I just thought it is like other
variants with just some less words in data header, but unfortunately this is
less than half of the truth.

According to mentioned specification the six-word (24-byte) file header
contains twelve unsigned 16-bit integers which describes general TFM
characteristics (the length of the file, the range of character codes
contained in the font, and the size of each of the tables). According to
specification i patched file command ( See appended file.tmp in output and
nonames/output).

On specification are some formulas listed like:
    bc-1 <= ec < =255
    ne <=256
    lf=6+lh+(ec-bc+1)+nw+nh+nd+ni+nl+nk+ne+np

The mentioned specification is an archived version on archive.org dated about
2012. Obviously these described items does not match "newer exotic" fonts like
Japanese. So i assume the described items only apply in full truth for fonts
with 8 bits or lower. The bc values at offset 4 in my samples was like 43
(=2Bh), 27 (=1Bh) or 33 (21 hexadecimal). The ec values at offset 6 in my
samples was 18 (12 hexadecimal) or 2.

Inside tfm-tex-0x01-names.trid.xml this is is expressed by XML construct that
looks like:
 <Bytes>0001002B001200000000000200020002000100000000000000090000000000A00000</Bytes>
 <ASCII> . . . +</ASCII>
 <Pos>2</Pos>

So here all values except first one are constant in thousands of
samples. Maybe this is triggered that samples are part of
texlive-lang-japanese package. So i mention observed items inside remark line:

The variant with lh=1h at offset 2, bc=2Bh (*4=172 file size) at offset 4,
ec=12h at offset 6, nw=0 at offset 8, nh=0 at offset 10, nd=2 at offset 12,
ni=2 at offset 14, nl=2 at offset 16, nk=1 at offset 18, ne=0 at offset 20,
np=0 at offset 22.

Apparently the file size here is not stored in field lf at offset 0. Instead
the values appears at offset plus 4 bytes higher. So what is called ec
according to documentation contains here the file size in words. So by
multiplying this value (2Bh=43) with four you get the real file size 172 in
bytes.

When using standard interpretation then value 9 in above construct would be
interpreted as 32-bit check sum. I believe that this not true.  When using
interpretation with 4 bytes shift then the following 4 nil bytes would be
checksum 0. The value zero means no check is made. Then the next following
32-bit 00A00000 would mean design size of the font in fix_word. I remember
that i have seen that value in other variants. So i believe this true. So i
keep the above construct.

Inside tfm-tex-0x01.trid.xml this is is expressed by XML constructs that looks
like:
   <Bytes>000100</Bytes>
   <Pos>2</Pos>
   ...
   <Bytes>000200000000000200020002000100000000000000</Bytes>
   <Pos>6</Pos>
So here ec is also lower than bc.

If try to convert here like in other variants by running a command line tool like:
   tftopl zu-cidjmr5-v.tfm zu-cidjmr5-v.pl
      I got output like:
This is TFtoPL, Version 3.3 (MiKTeX 24.3)
There's some extra junk at the end of the TFM file,
but I'll proceed as if it weren't there.
The header length is only 1!
Sorry, but I can't go on; are you sure this is a TFM?

In other variants the number of words file length is stored as 2 byte integer
in big endian at offset 0. By multiplying this value with 4 the file size in
bytes can be obtained. In this variant this information is stored at offset 4
bytes higher.
      So i tried commands like:
   dd bs=1 skip=4 if=zu-cidjmr5-v.tfm of=zu-cidjmr5-v.bin
   tftopl zu-cidjmr5-v.bin zu-cidjmr5-v.pl
      Now i got output like:
This is TFtoPL, Version 3.3 (MiKTeX 24.3)
The file has fewer bytes than it claims!
Sorry, but I can't go on; are you sure this is a TFM?
      Then i tried commands like:
   cp zu-cidjmr5-v.bin zu-cidjmr5-v-mod.bin
   echo.exe -n "1234" >> zu-cidjmr5-v-mod.bin
   tftopl zu-cidjmr5-v-mod.bin  zu-cidjmr5-v.pl
      Now i got output like:
This is TFtoPL, Version 3.3 (MiKTeX 24.3)
Subfile sizes don't add up to the stated total!
Sorry, but I can't go on; are you sure this is a TFM?

Apparently also the coding scheme name is stored at higher offset (37=33+4
compared with other variants). So after coding scheme name (maximal 39) like
(TEX KANJI TEXT) (UNSPECIFIED) the up case letter I and the remaining 25
padding bytes in my examples are expressed inside tfm-tex-0x01-names.trid.xml
by XML constructs like:
   <Pattern>
      <Bytes>49</Bytes>
      <ASCII> I</ASCII>
      <Pos>45</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00000000000000000000000000000000000000000000000000</Bytes>
      <Pos>51</Pos>
   </Pattern>
When assuming also other long encoding names with length 38 instead of maximal
39 then only one padding byte will survive and construct becomes like:
   <Bytes>00</Bytes>
   <Pos>75</Pos>

Afterwards at offset 77 (4 bytes higher than compared with other variants)
font family name (like OTF KANJI, JODEL or UNSPECIFIED maximal 19) is
stored. So after font family name (maximal 19) the remaining 8 padding bytes
are stored.  Here at offset 96 (92 plus 4 compared with other variants) again
seems to be stored seven bit safe byte with value 80h. These are expressed by
XML construct like:
 <Bytes>000000000000000080000000000000000111000000000000001000000000000000</Bytes>
 <Pos>88</Pos>
So when fields are here found at 4 higher offsets then what is shown here as
ec value is probably the real data header size with value 18. That means
header[17] is the last element. When this behave like described this element
contains a first byte called the seven_bit_safe_flag, then two bytes that are
ignored, and a fourth byte called the face. This also means at offset 100
(=24+4+18*4) next structure begins. When assuming also other long font family
with length 18 instead of maximal 19 then only on padding byte will survive
and construct becomes like:
 <Bytes>0080000000</Bytes>
 <Pos>95</Pos>

That also means that patterns at higher offset belong to next structures and
are similar because of lucky circumstances. That was expressed by XML
constructs like:
   <Pattern>
      <Bytes>0000000000</Bytes>
      <Pos>124</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00000000000000000000000000</Bytes>
      <Pos>132</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00000000001000000010000000</Bytes>
      <Pos>148</Pos>
   </Pattern>
   <Pattern>
      <Bytes>000000</Bytes>
      <Pos>162</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>168</Pos>
   </Pattern>
   <Pattern>
      <Bytes>0000</Bytes>
      <Pos>170</Pos>
   </Pattern>
So i delete such "high" patterns.

Unfortunately there exist a variant with lh=1 value and without ASCII like
strings for encoding scheme name and font family names. So hundreds of such
samples are described by tfm-tex-0x01.trid.xml.

In other variant at offset 37 encoding names with maximal length 39 are
stored. This followed at offset 77 with font family name with maximal length
19. So in my naive thinking i would expect nil bytes in that area when there
exist no names. But apparently this not true. So this is expressed by XML
constructs like:
   <Pattern>
      <Bytes>0000000000A00000000000000111000000000000001000000000000000</Bytes>
      <Pos>28</Pos>
   </Pattern>
   <Pattern>
      <Bytes>0000000000</Bytes>
      <Pos>60</Pos>
   </Pattern>
   <Pattern>
      <Bytes>0000000000000000000000000000</Bytes>
      <Pos>68</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00000000001000000010000000</Bytes>
      <Pos>84</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00000000000000000000</Bytes>
      <Pos>98</Pos>
   </Pattern>
So i do not know and understand what is going on here. So i keep these
constructs.

When comparing with other variants the next following 32-bit 00A00000 at
offset 32 would mean design size of the font in fix_word. I remember that i
have seen that value in other variants. So i believe this true. So i mention
this observation in remark line.

The significant part with lh=1 is expressed in this variant by XML construct
like:
   <Bytes>000100</Bytes>
   <Pos>2</Pos>
Apparently the file size here is also not stored in field lf at offset
0. Instead the values appears at offset plus 4 bytes higher. So what is called
ec according to documentation contains the file size in words. So by
multiplying this value (1Bh=27 or 21h=33) with four you get the real file size
(108 or 132) in bytes.

The next significant part is expressed by XML construct like:
   <Bytes>000200000000000200020002000100000000000000</Bytes>
   <Pos>6</Pos>
So the remaining fields in header are also constant and in most fields i get the
same value as in other variant. The only difference it that ec field at offset
6 has value 2. So i mention observed items inside remark line.

As far as i can see there exist less than dozen of variants with other lh
values. In other variants the lh values does change only a little bit. I will
handle the other variants in a future session.

With the new definitions TFM samples with lh=01 are now recognized and
described (see appended trid-v-new.txt trid-new.txt in output and
nonames/output/). The definition is "good", that it does not misidentifies non
TFM samples. And because of some more conditions compared with other variants
the description as "TeX Font Metric" comes first.

TrID definitions, some samples and output are stored in archive
tfm_0x01.zip. I hope that my definition can be used in future version of
triddefs. As mentioned there exist other variants of TFM. I will try to handle
these in a future session.

With best wishes
J?rg Jenderek