Author Topic: tfm-tex-0x23.trid.xml for TeX Font Metric; variants with lh=0x23  (Read 423 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
Hello trid users,

some days ago i looked at the content of an exotic CD-ROM. There are also
stored samples which are misidentified. The samples have TFM file name suffix

There exist other variants. So in this session i will handle only variant with
lh=0x23. I will explain later what this means. I found no such samples after
installing MiKTeX version 23.12 on Windows.  On Linux Mint 21.3 i found
hundreds of such samples with font family name OTF KANJI and encoding name
(TEX KANJI TEXT) as part of texlive-lang-japanese package with version
2021.20220204-1.

So i run trid utility on my TFM samples with lh=0x23. The samples are not
recognized. Many are described wrong as "Adobe PhotoShop Brush" by
abr.trid.xml with file name suffix (.ABR) (see appended trid-old.txt
trid-v-old.txt in output).

For comparison reason i also run the file format identification utility DROID
(See https://sourceforge.net/projects/droid/). Here no sample is recognized.

For comparison reason i also run file command (version 5.45) on such
samples. Here these TFM samples are not recognized and not described as "TeX
font metric data". These are described as "data" (see appended file-k-5.45.txt
in output). For theses TFM samples here mime type application/x-tex-tfm is not
shown (see appended file-i-5.45.txt in output). Here no file name suffix is
shown (see appended file-ext-5.45.txt in output).

Luckily i found page about TeX Font Metrics on file formats archive team web
site and on Wikipedia. So i use the first because the Wikipedia link is there
also mentioned and furthermore link to download samples are here listed. So
the reference URL in new definition is expressed by line like:
 <RefURL>http://fileformats.archiveteam.org/wiki/TeX_Font_Metrics/RefURL>

So i run tridscan on my samples to generate tfm-tex-0x23.trid.xml. Afterwards
i tried to understand the generated constructs and look if these are always
true. I just thought it is like other variants with just some less words in
data header, but unfortunately this is less than half of the truth.

According to mentioned specification the six-word (24-byte) file header
contains twelve unsigned 16-bit integers which describes general TFM
characteristics (the length of the file, the range of character codes
contained in the font, and the size of each of the tables). According to
specification i patched file command ( See appended file.tmp in output and
nonames/output).

On specification are some formulas listed like:
    bc-1 <= ec < =255
    ne <=256
    lf=6+lh+(ec-bc+1)+nw+nh+nd+ni+nl+nk+ne+np

The mentioned specification is an archived version on archive.org dated about
2012. Obviously these described items does not match "newer exotic" fonts like
Japanese. So i assume the described items only apply in full truth for fonts
with 8 bits or lower. The bc values at offset 4 in my samples was like 126
(=7Eh), 144 (=90h) or 146 (=92h). The ec values at offset 6 in my samples was
18 (12 hexadecimal).

Inside tfm-tex-0x23.trid.xml this is is expressed by XML construct that looks
like:
   <Bytes>0012000000</Bytes>
   <Pos>6</Pos>

So here all other values in header except first one (lf), nh, nd and ne are
constant in hundreds of samples. Maybe this is triggered that samples are part
of texlive-lang-japanese package. So i mention observed items inside remark
line:
The variant with lh=23h at offset 2, bc=2Bh (*4=504 576 584 file size) at
offset 4, ec=12h at offset 6, nw=0 at offset 8, nh<256 at offset 10, nd<256 at
offset 12, ni=2 at offset 14, nl=2 at offset 16, nk=1 at offset 18, ne<256 at
offset 20, np=0 at offset 22.

This is expressed by XML constructs like:
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>0</Pos>
   </Pattern>
   <Pattern>
      <Bytes>0012000000</Bytes>
      <Pos>6</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>12</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00020002000100</Bytes>
      <Pos>14</Pos>
   </Pattern>
   <Pattern>
      <Bytes>000100</Bytes>
      <Pos>22</Pos>
   </Pattern>
So here ec is also lower than bc.

Apparently the file size here is not stored in field lf at offset 0. Instead
the values appears at offset plus 4 bytes higher. So what is called ec
according to documentation contains here the file size in words. So by
multiplying this value (7Eh 90h 92h) with four you get the real file size (504
576 584) in bytes.

When using standard interpretation then value 9 at offset 27 in seventh
construct would be interpreted as part of 32-bit check sum. I believe that
this not true.  When using interpretation with 4 bytes shift then the
following 4 nil bytes would be checksum 0. The value zero means no check is
made. Then the next following 32-bit 00A00000 would mean design size of the
font in fix_word. I remember that i have seen that value in other variants. So
i believe this true. So i keep the above constructs.

If try to convert here like in other variants by running a command line tool like:
   tftopl rubyminr-v.tfm rubyminr-v.pl
      I got output like:
This is TFtoPL, Version 3.3 (MiKTeX 24.3)
There's some extra junk at the end of the TFM file,
but I'll proceed as if it weren't there.
The character code range 126..18 is illegal!
Sorry, but I can't go on; are you sure this is a TFM?

In other variants the number of words file length is stored as 2 byte integer
in big endian at offset 0. By multiplying this value with 4 the file size in
bytes can be obtained. In this variant this information is stored at offset 4
bytes higher.
      So i tried commands like:
   dd bs=1 skip=4 if=rubyminr-v.tfm of=rubyminr-v.bin
   tftopl zu-cidjmr5-v.bin zu-cidjmr5-v.pl
      Now i got output like:
This is TFtoPL, Version 3.3 (MiKTeX 24.3)
The file has fewer bytes than it claims!
Sorry, but I can't go on; are you sure this is a TFM?
      Then i tried commands like:
   cp rubyminr-v.bin rubyminr-v-mod.bin
   echo -n "1234" >> rubyminr-v-mod.bin
   tftopl rubyminr-v-mod.bin rubyminr-v-mod.pl
      Now i got output like:
This is TFtoPL, Version 3.3 (MiKTeX 24.3)
Subfile sizes don't add up to the stated total!
Sorry, but I can't go on; are you sure this is a TFM?

Apparently also the coding scheme name is stored at higher offset (37=33+4
compared with other variants). So after coding scheme name (maximal 39) like
(TEX KANJI TEXT) with string length 14 (=Eh) the remaining 25 padding bytes in
my examples are nil. Apparently also the font family name is stored at higher
offset (77=73+4 compared with other variants). So after family name (OTF KANJ
maximal 19) with string length 9 the remaining 10 padding bytes in my examples
are nil.  Here at offset 96 (92 plus 4 compared with other variants) again
seems to be stored seven bit safe byte with value 80h. These observations are
expressed by by XML constructs like:

<Bytes>00090000000000A00000
0E544558204B414E4A49205445585400000000000000000000000000000000000000000000000000
094F5446204B414E4A49000000000000000000008000000000000000212200</Bytes>
<ASCII> . . . . . . . . . .
. T E X   K A N J I   T E X T
. . . . . . . . . . . . . . . . . . . . . . . . .
. O T F   K A N J I . . . . . . . . . . . . . . . . . . ! "</ASCII>
<Pos>26</Pos>

When assuming 4 byte shifted interpretation at offset 32 probably the design
font size 00A0000 and before is 32-bit checksum 0, then before some bytes with
value 9, then delete bytes before "checksum" part. So when assuming bc=18 is
the real data header size then header[17] is last part in data header.  Here
at offset 96 (92 plus 4 compared with other variants) again seems to be stored
seven bit safe byte with value 80h. After this byte comes 2 unused byres
(apparently nil) followed by face byte. So at offset 100 next structure
starts. So i delete the last bytes after "face" byte. So the above construct
will become like:

<Bytes>0000000000A00000
0E544558204B414E4A49205445585400000000000000000000000000000000000000000000000000
094F5446204B414E4A490000000000000000000080000000</Bytes>
<ASCII> . . . . . . . .
. T E X   K A N J I   T E X T . . . . . . . . . . . . . . . . . . . . . . . . .
. O T F   K A N J I . . . . . . . . . . . . . .</ASCII>
<Pos>28</Pos>

That also means that patterns at higher offsets belong to next structures and
are similar because of lucky circumstances. That was expressed by XML
constructs like:
   <Pattern>
      <Bytes>21230004212400</Bytes>
      <ASCII> ! # . . ! $</ASCII>
      <Pos>108</Pos>
   </Pattern>
   <Pattern>
      <Bytes>2125000421260003212700032128000321290006212A0006213D000521440005
      <ASCII> ! % . . ! . . . ! ' . . ! ( . . ! ) . . ! * . . ! = . . ! D . .
      <Pos>116</Pos>
   </Pattern>
   <Pattern>
      <Bytes>110100</Bytes>
      <Pos>241</Pos>
   </Pattern>
   <Pattern>
      <Bytes>110102</Bytes>
      <Pos>245</Pos>
   </Pattern>
   <Pattern>
      <Bytes>110103</Bytes>
      <Pos>249</Pos>
   </Pattern>
   <Pattern>
      <Bytes>1101</Bytes>
      <Pos>253</Pos>
   </Pattern>
   <Pattern>
      <Bytes>011101</Bytes>
      <Pos>256</Pos>
   </Pattern>
   <Pattern>
      <Bytes>1101</Bytes>
      <Pos>261</Pos>
   </Pattern>
   <Pattern>
      <Bytes>1101</Bytes>
      <Pos>265</Pos>
   </Pattern>
   ...
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>500</Pos>
   </Pattern>
   <Pattern>
      <Bytes>0000</Bytes>
      <Pos>502</Pos>
   </Pattern>
So i delete such "high" patterns.

As far as i can see there exist less than dozen of variants with other lh
values. In other variants the lh values does change only a little bit. I will
handle the other variants in a future session.

With the new definitions TFM samples with lh=23h are now recognized and
described (see appended trid-v-new.txt trid-new.txt in output). The definition
is "good", that it does not misidentifies non TFM samples. And because of some
more conditions compared with other variants the description as "TeX Font
Metric" comes first.

TrID definitions, some samples and output are stored in archive
tfm_0x23.zip. I hope that my definition can be used in future version of
triddefs. As mentioned there exist other variants of TFM. I will try to handle
these in a future session.

With best wishes
J?rg Jenderek

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Re: tfm-tex-0x23.trid.xml for TeX Font Metric; variants with lh=0x23
« Reply #1 on: May 01, 2024, 07:36:04 PM »
Thanks!