Author Topic: tfm-tex-0x15.trid.xml for TeX Font Metric; variant with lh=0x15h  (Read 445 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
Hello trid users,

some days ago i looked at the content of an exotic CD-ROM. There are also
stored samples which are misidentified. The samples have TFM file name suffix

There exist other variants. So in this session i will handle only variant with
lh=0x15. I will explain later what this means.

So i run trid utility on my TFM samples with lh=0x15h. The samples are not
recognized. Many are described as "Unknown!". For not unknown samples i get
many different descriptions, but all are wrong (see appended trid-v-old.txt in
output). I get many dozens of such TFM samples. It took some time to get dozen
of non TFM samples which matches the misidentified TFM samples.

For comparison reason i also run the file format identification utility DROID
(See https://sourceforge.net/projects/droid/). Here the samples are also not
recognized and here i got no false description.

For comparison reason i also run file command (version 5.45) on such
samples. Here all samples are not recognized and not described as "TeX font
metric data". Many are described as "data". Unfortunately i get for some
samples also other descriptions.  (see appended file-5.45.txt in output). For
theses TFM samples here mime type application/x-tex-tfm is not shown (see
appended file-i-5.45.txt in output). Here no file name suffix is shown (see
appended file-tex-ext-5.45.txt in output).

Luckily i found page about TeX Font Metrics on file formats archive team web
site and on Wikipedia. So i use the first because the Wikipedia link is there
also mentioned and furthermore link to download samples are here listed. So
the reference URL in new definition is expressed by line like:
 <RefURL>http://fileformats.archiveteam.org/wiki/TeX_Font_Metrics/RefURL>

So i run tridscan on my samples to generate tfm-tex-0x15.trid.xml.  Afterwards
i tried to understand the generated constructs and look if these are always
true. According to specification the six-word (24-byte) file header contains
twelve unsigned 16-bit integers which describes general TFM characteristics
(the length of the file, the range of character codes contained in the font,
and the size of each of the tables). According to specification i patched file
command ( See appended file.tmp in output).

On specification are some formulas listed like:
    bc-1 <= ec < =255
    ne <=256
    lf=6+lh+(ec-bc+1)+nw+nh+nd+ni+nl+nk+ne+np

That means that at least three fields (bc,ec,ne) are always lower 256. Because
the files are stored in big endian format that means upper byte of these
fields are nil. Apparently nearly all others of these twelve fields are below
256. So at even offsets we have nil bytes. That is expressed by XML constructs
like:
   ...
   <Pattern>
      <Bytes>0001</Bytes>
      <Pos>14</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>18</Pos>
   </Pattern>
   ..
The only exceptions is number of words file length and number of words in the
lign table. The file length is sometimes bigger than 255. That value is stored
at offset 0 as field lf in word units. By multiplying this value with 4 the
file size in bytes can be obtained.

At offset 10 the number of words in the height table is stored as 2 byte
integer in big endian format. In this variant this value is always 10h. That
is expressed by XML construct that looks like:
   <Pattern>
      <Bytes>001000</Bytes>
      <Pos>10</Pos>
   </Pattern>
But i do not know if this always true. So i keep it at the moment and mention
my observations in remark line.

At offset 14 the number of words in italic correction table is stored as 2
byte integer in big endian format. In this variant this value is always
1. That is expressed by XML construct that looks like:
   <Pattern>
      <Bytes>0001</Bytes>
      <Pos>14</Pos>
   </Pattern>
But i do not know if this always true. So i keep it at the moment and mention
my observations in remark line.

At offset 20 the number of words in extensible character table is stored as 2
byte integer in big endian format. In this variant this value is always 0. At
offset 22 the number of font parameter words is stored as 2 byte integer in
big endian format. In this variant this value is always 7.  That is expressed
by XML construct that looks like:
   <Pattern>
      <Bytes>00000007</Bytes>
      <Pos>20</Pos>
   </Pattern>
But i do not know if this always true. So i keep it at the moment and mention
my observations in remark line.

Then only remaining construct looks like:
   <Bytes>00A00000</Bytes>
   <Pos>28</Pos>

According to documentation this is element header[1]. That is the size of the
font in fix_word are units (4 bytes) of TeX points. So in the samples the
"value" is 00A00000. In the other variant i got different values.  But i do
not know if this always true. So i keep it at the moment and mention my
observations in remark line.

If samples are real TFM that can be verified by running a command line tool
like:
   tftopl tri10u.tfm tri10u.pl

Now comes the interesting part. At offset 2 the length of the header data is
in word units. For some dozens of my inspected TFM samples this value is 21
(=0x15). The samples in this session all have this value. Together with upper
nil byte of ec (last character code in the font) this significant part is
expressed by XML construct like:
   <Bytes>001100</Bytes>
   <Pos>2</Pos>

As far as i can see there exist less than dozen of variants with other lh
values. In other variants the lh values does change only a little bit. I will
handle the other variants in a future session.

The next characteristic construct in this variant looks like:
 <Bytes>000000000000000000000000000000000000000000000000000000000000000000000000
 0948504155544F54464D00000000000000000000800000</Bytes>
 <ASCII> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
 . H P A U T O T F M</ASCII>
 <Pos>36</Pos>
At offset 33 an ASCII like coding scheme name is stored. The maximal string
length is 39 and this length value is stored in byte before.  In this variant
with just only some dozen of samples the coding names are very short (like 6J
8U 9T 10U see file.tmp in output). So longest length is 3. That mean 36
remaining bytes of this names are unused. So we get 36 nil bytes at offset 36.
At offset 73 an ASCII like font family name is stored. The length of this
string is always 9 in this variant and this length value is stored in byte
before and the font family in all samples of this variant is HPAUTOTFM.  Like
in variant with lh=0x12 according to documentation the header[17] word
contains a first byte called the seven_bit_safe_flag, then two bytes that are
ignored, and a fourth byte called the face. When looking in file.tmp for
seven_bit_safe_byte i get value always 0x80. Apparently for two ignored/unused
bytes i get values nil. For face byte i got different values (like 0 1 3).
Assuming that there maybe exist samples with longer encoding names with
maximal 39 bytes the above construct becomes like:
 <Bytes>0948504155544F54464D00000000000000000000800000</Bytes>
 <ASCII> . H P A U T O T F M</ASCII>
 <Pos>72</Pos>
I do not know if there exist samples with other font families or if HPAUTOTFM
family is a characteristic of variant with lh=15h. So keep this and mention my
observations in remark line. I found samples in sub directories helvetica and
symbol in parent directory monotype which itself is found in
/usr/share/texlive/texmf-dist/fonts/tfm (Linux Mint 21.2).

Compared with first variant data head contains 3 more element. That are
header[18] at offset 96, header[19] at 100 and header[20] at 104.  So for
these additional elements i got at some locations constant values. That is
expressed by XML constructs like:
   <Pattern>
      <Bytes>4B4E</Bytes>
      <ASCII> K N</ASCII>
      <Pos>96</Pos>
   </Pattern>
   <Pattern>
      <Bytes>0000</Bytes>
      <Pos>99</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>102</Pos>
   </Pattern>
   <Pattern>
      <Bytes>0000</Bytes>
      <Pos>104</Pos>
   </Pattern>
I do not know if this really is characteristic for this variant. So i keep it
and mention my observation in remark line.

Then according to documentation here at offset 108 the next structure
starts. That is array char_info. The units of this array is char_info_word (4
bytes). Like in 0x11 variant apparently parts of char_info_word often are
nil. So these observations are expressed by constructs like:
   <Pattern>
      <Bytes>0000</Bytes>
      <Pos>110</Pos>
   </Pattern>
   <Pattern>
      <Bytes>0000</Bytes>
      <Pos>114</Pos>
   </Pattern>
   ...
I do not understand what and why, but this not relevant at the moment. When i
understand documentation right in worst case bc (first character code) is
equal to ec (last character code). That means this array would contain only 1
element. That starts at offset 108 and at offset 111 next structure would
start. So i can delete all constructs with offset 108 and higher. So only one
construct survive. That describes first element char_info[0]. That is done by
construct that looks like:
   <Pattern>
      <Bytes>0000</Bytes>
      <Pos>110</Pos>
   </Pattern>

With the new definition all TFM samples with header size 0x15 are now
recognized and described (see appended trid-v-new.txt trid-new.txt in
output). The definition is "good", that it does not misidentifies non TFM
samples. And because of some more conditions compared with other variants the
description as "TeX Font Metric" comes first.

TrID definitions, some samples and output are stored in archive
tfm_0x15.zip. I hope that my definition can be used in future version of
triddefs. As mentioned there exist other variants of TFM. I will try to handle
these in a future session.

With best wishes
J?rg Jenderek

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Re: tfm-tex-0x15.trid.xml for TeX Font Metric; variant with lh=0x15h
« Reply #1 on: April 08, 2024, 09:09:28 PM »
Thanks!