Author Topic: tfm-tex-0x02.trid.xml for TeX Font Metric; variant with lh=0x02  (Read 10 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 369
Hello trid users,

some days ago i looked at the content of an exotic CD-ROM. There are also
stored samples which are misidentified. The samples have TFM file name suffix

There exist other variants. So in this session i will handle only variant with
lh=0x02. I will explain later what this means.

I found such samples after installing MiKTeX version 23.12 on Windows. On
Linux Mint 21.3 i found such samples as part of packages (like texlive-base
texlive-fonts-recommended texlive-lang-greek texlive-lang-other
texlive-latex-extra texlive-music texlive-pictures texlive-science).

So i run trid utility on my TFM samples with lh=0x02. The samples are not
recognized. Many are described as "Unknown!". For not unknown samples i get
many different descriptions, but all are wrong (see appended trid-v-old.txt in
output). I get many hundreds of such TFM samples. It took some time to get
dozen of non TFM samples which matches the misidentified TFM samples. Many are
described as "Adobe PhotoShop Brush" by abr.trid.xml. Few are described as
"Commodore 128 BASIC V7.0 program" by prg-c128.trid.xml or as "Commodore 128
BASIC V7.0 program (graph mode on)" by prg-c128-gfx.trid.xml. Few (like
cmman.tfm gen9.tfm) are described as "MacBinary 1" by
macbinary-1.trid.xml. Few samples (like yarborn.tfm) are described as "PC9801
rip" by m-mod.trid.xml.

For comparison reason i also run the file format identification utility DROID
(See https://sourceforge.net/projects/droid/). Here most of the samples are
also not recognized.  Samples with bin suffix like (like ttcomp-bin-4k.bin)
are therefore described as "Binary File" by PUID fmt/208. Samples with m
suffix like (like DIES_13.M EVE_18.M EVE_19A.M EVE_26.M) are therefore
described as "MATLAB Script File" by PUID fmt/1678.

For comparison reason i also run file command (version 5.45) on such
samples. Here all such samples are not recognized" and not described as "TeX
font metric data". A few samples with M suffix (like EVE_18.M) are
misidentified as "TeX font metric data".  Few samples (like cmfibs8.tfm
fcitt12.tfm rgrbf10.tfm) are described as "executable" for "MIPS" or "amd 29k"
architecture Few samples (like cmrgrsl10.tfm yrcmex10.tfm) are described as
"object" for "Tower/XP" architecture.  This behaviour get not better when
using no keep going option of file command (see appended file-5.45.txt in
output). For the TFM samples no mime type application/x-tex-tfm is shown (see
appended file-i-5.45.txt in output). Here no file name suffix is shown (see
appended file-ext-5.45.txt in output).

Luckily i found page about TeX Font Metrics on file formats archive team web
site and on Wikipedia. So i use the first because the Wikipedia link is there
also mentioned and furthermore link to download samples are here listed. So
the reference URL in new definition is expressed by line like:
 <RefURL>http://fileformats.archiveteam.org/wiki/TeX_Font_Metrics/RefURL>

So i run tridscan on my samples to generate tfm-tex-0x02.trid.xml.  Afterwards
i tried to understand the generated constructs and look if these are always
true. According to specification the six-word (24-byte) file header contains
twelve unsigned 16-bit integers which describes general TFM characteristics
(the length of the file, the range of character codes contained in the font,
and the size of each of the tables). According to specification i patched file
command (see appended file.tmp in output).

On specification are some formulas listed like:
    bc-1 <= ec < =255
    ne <=256
    lf=6+lh+(ec-bc+1)+nw+nh+nd+ni+nl+nk+ne+np
That means that at least three fields (bc,ec,ne) are always lower 256. Because
the files are stored in big endian format that means upper byte of these
fields are nil. Apparently nearly all others of these twelve fields are below
256.  So at even offsets we have nil bytes. That is expressed by XML
constructs like:
   ...
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>20</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>22</Pos>
   </Pattern>
   <Pattern>

The only exceptions are number of words in the lig table (nl) and file length
(lf). The first value is stored at offset 16 as field nk and is sometimes
bigger than 255. Also the file length is sometimes bigger than 255 (like
casyll10.tfm cmfibs8.tfm fcbx10.tfm fcitt12.tfm mrgrsl10.tfm rgrbf10.tfm
wasysl10.tfmyrcmex10.tfm ). That value is stored at offset 0 as field lf in
word units. By multiplying this value with 4 the file size in bytes can be
obtained.

If samples are real TFM that can be verified by running a command line tool
like:
   tftopl yarborn.tfm yarborn.pl

Now comes the interesting part. At offset 2 the length of the header data is
in word units. For some of my inspected TFM samples this value is 2
(=0x02). The samples in this session all have this value. Together with upper
nil byte of bc (first character code in the font) this significant part is
expressed by XML construct like:
   <Bytes>000200</Bytes>
   <Pos>2</Pos>

As far as i can see there exist less than dozen of variants with other lh
values. In other variants the lh values does change only a little bit. I will
handle the other variants in a future session.

When header size is 2 then there exist only two elements (header[0] is a
32-bit check sum; header[1] size of the font (fix_word are units of TeX
points). So in this variant there exist no header[2..11] (coding name) and no
header[12..16] (font family name). So these samples apparently contain no ASCI
like strings.

With the new definition all TFM samples with header size 0x02 are now
recognized and described (see appended trid-v-new.txt trid-new.txt in
output). The definition is "good" , that it does not misidentifies non TFM
samples. Unfortunately for some TFM samples the description as TeX Font Metric
is not the first. The main reason is that significant characteristic is done
by 16-bit lh value.

Luckily i found page about audio samples with m suffix on file formats archive
team web site. So i use this. So the reference URL in definition is expressed
by line like:

 <RefURL>
 http://fileformats.archiveteam.org/wiki/Professional_Music_Driver_PMD
 </RefURL>

TrID definitions, some samples and output are stored in archive
tfm_0x02.zip. I hope that my definition can be used in future version of
triddefs. As mentioned there exist other variants of TFM. I will try to handle
these in a future session.

With best wishes
J?rg Jenderek