Author Topic: tfm-tex-0x11.trid.xml for TeX Font Metric; variant with lh=0x11  (Read 368 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
Hello trid users,

some days ago i looked at the content of an exotic CD-ROM. There are also
stored samples which are misidentified. The samples have TFM file name suffix

There exist other variants. So in this session i will handle only variant with
lh=0x11. I will explain later what this means.

So i run trid utility on my TFM samples with lh=0x11. The samples are not
recognized. Many are described as "Unknown!". For not unknown samples i get
many different descriptions, but all are wrong (see appended trid-v-old.txt in
output). I get many hundreds TFM samples. It took some time to get dozen of
non TFM samples which matches the misidentified TFM samples.

For comparison reason i also run the file format identification utility DROID
(See https://sourceforge.net/projects/droid/). Here the samples are also not
recognized.

For comparison reason i also run file command (version 5.45) on such
samples. Here all samples are "recognized" and described as "TeX font metric
data". Also some more details are shown. In parenthesis the coding scheme name
{like (TeXBase1Encoding) (AdobeStandardEncoding) (cochalphEncoding)
(FontSpecific) (kerkisec) (TeXBase1Encoding) } is shown (see appended
file-k-5.45.txt in output). This can be seen more clearly when using only tex
magic pattern (see appended file-tex-5.45.txt in output). Unfortunately i get
for most samples also another description when using keep option -k of file
command. Even worse in some samples (like NewTXMI.tfm fxlzi-5letters.tfm
pplri8a.tfm rpplru.tfm rtxbmi-rev.tfm) the wrong description comes first.
This can be seen more clearly when using no keep going option of file command
(see appended file-5.45.txt in output). For the TFM samples mime type
application/x-tex-tfm is shown (see appended file-tex-i-5.45.txt in
output). Here no file name suffix is shown (see appended file-tex-ext-5.45.txt
in output).

Luckily i found page about TeX Font Metrics on file formats archive team web
site and on Wikipedia. So i use the first because the Wikipedia link is there
also mentioned and furthermore link to download samples are here listed. So
the reference URL in new definition is expressed by line like:
 <RefURL>http://fileformats.archiveteam.org/wiki/TeX_Font_Metrics/RefURL>

So i run tridscan on my samples to generate tfm-tex-0x11.trid.xml. Afterwards
i tried to understand the generated constructs and look if these are always
true. According to specification the six-word (24-byte) file header contains
twelve unsigned 16-bit integers which describes general TFM characteristics
(the length of the file, the range of character codes contained in the font,
and the size of each of the tables). According to specification i patched file
command ( See appended file.tmp in output).

On specification are some formulas listed like:
    bc-1 <= ec < =255
    ne <=256
    lf=6+lh+(ec-bc+1)+nw+nh+nd+ni+nl+nk+ne+np

That means that at least three fields (bc,ec,ne) are always lower 256. Because
the files are stored in big endian format that means upper byte of these
fields are nil. Apparently nearly all others of these twelve fields are below
256.  So at even offsets we have nil bytes. That is expressed by XML
constructs like:
   ...
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>12</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>14</Pos>
   </Pattern>

At offset 16 the number of words in lig table is stored as 2 byte integer in
big endian (nl). This is followed by number of words in kern table (nk). This
is followed by number of words in extensible character table (ne). In my
samples these 3 integers are 0, but i do not know if this is always true
in. In variant ( with lh=0x12) it is not. So i mention my observations in the
remark line. At offset 22 the number of font parameter words (np) is stored. In
my examples the value was 6, but i do not know if this is always true in. In
variant (with lh=0x12) it is not. So i mention my observation i remark line.
These observations are expressed by line like:
   <Bytes>0000000000000006</Bytes>
   <Pos>16</Pos>

The only exceptions is the file length.  That value is stored at offset 0 as
field lf in word units. By multiplying this value with 4 the file size in
bytes can be obtained.

If samples are real TFM that can be verified by running a command line tool
like:
   tftopl Cochineal-alph.tfm Cochineal-alph.pl

Now comes the interesting part. At offset 2 the length of the header data is
in word units. For most of my inspected TFM samples this value is 17
(=0x11). The samples in this session all have this value. Together with upper
nil byte of ec (last character code in the font) this significant part is
expressed by XML construct like:
   <Bytes>001100</Bytes>
   <Pos>2</Pos>

As far as i can see there exist less than dozen of variants with other lh
values. In other variants the lh values does change only a little bit. I will
handle the other variants in a future session.

According to documentation the header[17] word at offset 92 contains a first
byte called the seven_bit_safe_flag, then two bytes that are ignored, and a
fourth byte called the face. I just used some days to understand why this does
not apply to current variant, because in documentation is also written that
this applies when this is present. The first element is header[0]. That means
header[17] is element number 18 (hexadecimal 12), but in this variant the
header size is 17 (lh=0x11). That means in this variant there do not exist
element header[17]. That means at that offset the next structure
starts. According to documentation this is array char_info. The units of this
array is char_info_word (4 bytes). Apparently one byte char_info_word often is
0. So these observations are expressed by constructs like:

   <Pattern>
      <Bytes>00</Bytes>
      <Pos>95</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>99</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>115</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>123</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>131</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>135</Pos>
   </Pattern>

I do not understand what and why, but this not relevant at the moment. When i
understand documentation right in worst case bc (first character code) is
equal to ec (last character code). That means this array would contain only 1
element. That starts at offset 92 and at offset 96 next structure would
start. So i can delete all constructs with offset 96 and higher. So only one
construct survive. That describes first element char_info[0]. That is done by
construct that looks like:
   <Bytes>01</Bytes>
   <Pos>92</Pos>
I do understand what exactly this means. I also do not know if this always
true. So i mention my observations in the remark line.


The header[2..11], if present, contains 40 bytes that identify the character
coding scheme. The first byte, which must be between 0 and 39. Apparently in
my examples the maximal length 39 for coding names was not used. So the
remaining bytes are filled with nils. That was expressed by construct like:
   <Bytes>0000000000000000000000000000000000</Bytes>
   <Pos>55</Pos>
Assuming that there may exist examples with longest possible coding name the
above construct vanish.

Then only remaining construct looks like:
   <Bytes>00A00000</Bytes>
   <Pos>28</Pos>
According to documentation this is element header[1]. That is the size of the
font in fix_word are units (4 bytes) of TeX points. So in the samples the
"value" is 00A00000. In the other variant i got different values. So i assume
that for 0x11 variant also other font sizes may exist. So i delete the above
pattern.

At offset 33 an ASCII like coding scheme name is stored. The maximal string
length is 39 and this length value is stored in byte before.  At offset 73 an
ASCII like font family name is stored (like CMR ECBX DUMMYSPAC-FONTFORGE
TEX-PAGD8R-CSC UNSPECIFIED HelveNarBol) is stored. The maximal string length
is 19 and this length value is stored in byte before.

With the new definition all TFM samples with data header size 0x11 are now
recognized and described (see appended trid-v-new.txt trid-new.txt in
output). The definition is "good", that it does not misidentifies non TFM
samples. Unfortunately for some TFM samples (like Cochineal-alph.tfm
rpplru.tfm rtxbmi-rev.tfm ) the description as TeX Font Metric is not the
first. The main reason is that significant characteristic is done by 16-bit
1lh value.

TrID definitions, some samples and output are stored in archive
tfm_0x11.zip. I hope that my definition can be used in future version of
triddefs. As mentioned there exist other variants of TFM. I will try to handle
these in a future session.

With best wishes
J?rg Jenderek


Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Re: tfm-tex-0x11.trid.xml for TeX Font Metric; variant with lh=0x11
« Reply #1 on: April 06, 2024, 03:57:48 PM »
Thanks!