Author Topic: tfm-tex-0x78.trid.xml for TeX Font Metric; variant with lh=0x78  (Read 391 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
Hello trid users,

some days ago i looked at the content of an exotic CD-ROM. There are also
stored samples which are misidentified. The samples have TFM file name suffix

There exist other variants. So in this session i will handle only variant with
lh=0x78. I will explain later what this means. I found few dozens of such
samples (like nmin10.tfm in nmin-ngoth directory with parent directory
ptex-fonts inside fonts sub directory tfm) after installing MiKTeX version
23.12 on Windows.  On Linux Mint 21.2 i found such samples as part of
texlive-lang-japanese package with version 2021.20220204-1.

So i run trid utility on my TFM samples with lh=0x78h. The samples are not
recognized. Many are described wrong as "Adobe PhotoShop Brush" by
abr.trid.xml with file name suffix (.ABR). Some real ABR samples are described
as "TTComp archive compressed (bin-4K)" by ark-ttcomp-bin-4k.trid.xml.  (see
appended trid-v-old.txt in output).

It took some time to get few of non TFM samples (like *.gds) which matches the
misidentified TFM samples.

For comparison reason i also run the file format identification utility DROID
(See https://sourceforge.net/projects/droid/). Here no sample is recognized.
The sample with bin file name suffix is therefore described as "Binary File"
by PUID fmt/208.

For comparison reason i also run file command (version 5.45) on such
samples. Here these TFM samples are not recognized and not described as "TeX
font metric data". These are described as "data". On the other hand the ABR
samples are also not recognized. Many are described first with as "GDSII
Stream file" with some times obviously wrong and high version numbers. Many
are described also as "TTComp archive data, binary, 4K dictionary" (see
appended file-k-5.45.txt in output). For theses TFM samples here mime type
application/x-tex-tfm is not shown (see appended file-i-5.45.txt in
output). Here no file name suffix is shown (see appended file-ext-5.45.txt in
output).

Luckily i found page about TeX Font Metrics on file formats archive team web
site and on Wikipedia. So i use the first because the Wikipedia link is there
also mentioned and furthermore link to download samples are here listed. So
the reference URL in new definition is expressed by line like:
 <RefURL>http://fileformats.archiveteam.org/wiki/TeX_Font_Metrics/RefURL>

So i run tridscan on my samples to generate tfm-tex-0x78.trid.xml.  Afterwards
i tried to understand the generated constructs and look if these are always
true. I just thought it is like other variants with just some more words in
data header, but unfortunately this is less than half of the truth.

According to mentioned specification the six-word (24-byte) file header
contains twelve unsigned 16-bit integers which describes general TFM
characteristics (the length of the file, the range of character codes
contained in the font, and the size of each of the tables). According to
specification i patched file command ( See appended file.tmp in output).

On specification are some formulas listed like:
    bc-1 <= ec < =255
    ne <=256
    lf=6+lh+(ec-bc+1)+nw+nh+nd+ni+nl+nk+ne+np

The mentioned specification is an archived version on archive.org dated about
2012. Obviously these described items does not match "newer exotic" fonts like
Japanese. So i assume the described items only apply in full truth for fonts
with 8 bits or lower. According to the mentioned specification first character
code (bc) is "too high" (like 276 299). That is above 255. The ec values in my
samples was 18 (12 hexadecimal). So here ec is lower than bc.

If try to convert like in other variants by running a command line tool like:
   tftopl goth10.tfm goth10.pl

   I got output like:
This is TFtoPL, Version 3.3 (MiKTeX 24.3)
There's some extra junk at the end of the TFM file,
but I'll proceed as if it weren't there.
The character code range 299..18 is illegal!
Sorry, but I can't go on; are you sure this is a TFM?

In other variants the number of words file length is stored as 2 byte integer
in big endian at offset 0. By multiplying this value with 4 the file size in
bytes can be obtained. In this variant this information is stored at offset 4
bytes higher.

Apparently also the coding scheme name is stored at higher offset (37=32+4
compared with other variants). So 15 byte (0Eh maximal 39) coding scheme name
(TEX KANJI TEXT) in my examples is expressed by XML construct like:

 <Bytes>00000E544558204B414E4A492054455854
 0000000000000000000000000000000000000000000000000006</Bytes>
 <ASCII> . . . T E X   K A N J I   T E X T</ASCII>
 <Pos>34</Pos>

Afterwards at offset 77 (4 bytes higher than compared with other variants) 6
bytes font family name (like MINCHO or GOTHIC maximal 19) is stored. Here at
offset 96 (92 plus 4 compared with other variants) again seems to be stored
seven bit safe byte with value 80h. That is expressed by XML construct like:
   <Bytes>00000000000000000000000000800000</Bytes>
   <Pos>83</Pos>
But i do not know if this always true. So i keep it at the moment and mention
my observations in remark line.

Now comes the interesting part. At offset 2 the length of the header data is
in word units. For some dozens of my inspected TFM samples this value is 120
(=0x78). The samples in this session all have this value. Together with other
parts this is expressed by XML construct like:
   <Bytes>000B007801</Bytes>
   <ASCII> . . . x</ASCII>
   <Pos>0</Pos>

As far as i can see there exist less than dozen of variants with other lh
values. In other variants the lh values does change only a little bit. I will
handle the other variants in a future session.

Compared with other variants i get more patterns. Because mentioned
specification does not fully match i do not exactly know how to interpret
these pattern and if these are always true. So i keep must patterns. At higher
offsets i get short nil sequences like:
      <Pattern>
         <Bytes>00</Bytes>
         <Pos>632</Pos>
      </Pattern>
      ...
      <Pattern>
         <Bytes>00</Bytes>
         <Pos>984</Pos>
      </Pattern>
      ..
      <Pattern>
         <Bytes>00</Bytes>
         <Pos>1100</Pos>
      </Pattern>
I assume that these are triggered by lucky circumstances ( too few
examples). So i delete these patterns.

With the new definition all TFM samples with header size 0x78 are now
recognized and described (see appended trid-v-new.txt trid-new.txt in
output). The definition is "good", that it does not misidentifies non TFM
samples. And because of some more conditions compared with other variants the
description as "TeX Font Metric" comes first.

TrID definitions, some samples and output are stored in archive
tfm_0x78.zip. I hope that my definition can be used in future version of
triddefs. As mentioned there exist other variants of TFM. I will try to handle
these in a future session.

With best wishes
J?rg Jenderek


Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Re: tfm-tex-0x78.trid.xml for TeX Font Metric; variant with lh=0x78
« Reply #1 on: April 12, 2024, 10:04:22 PM »
Thanks!