81
TrID File Identifier / tfm-tex-0x40.trid.xml for TeX Font Metric; variant with lh=0x40
« Last post by jenderek on April 13, 2024, 06:19:39 PM »Hello trid users,
some days ago i looked at the content of an exotic CD-ROM. There are also
stored samples which are misidentified. The samples have TFM file name suffix
There exist other variants. So in this session i will handle only variant with
lh=0x40. I will explain later what this means. I found few dozens of such
samples (like tgoth10.tfm in standard directory with parent directory
ptex-fonts inside fonts sub directory tfm) after installing MiKTeX version
23.12 on Windows. On Linux Mint 21.2 i found such samples as part of
texlive-lang-japanese package with version 2021.20220204-1.
So i run trid utility on my TFM samples with lh=0x40. The samples are not
recognized. Many are described wrong as "Adobe PhotoShop Brush" by
abr.trid.xml with file name suffix (.ABR). Some real ABR samples are described
as "TTComp archive compressed (bin-4K)" by ark-ttcomp-bin-4k.trid.xml (see
appended trid-v-old.txt in output).
It took some time to get few of non TFM samples which matches the
misidentified TFM samples.
For comparison reason i also run the file format identification utility DROID
(See https://sourceforge.net/projects/droid/). Here no sample is recognized.
For comparison reason i also run file command (version 5.45) on such
samples. Here these TFM samples are not recognized and not described as "TeX
font metric data". These are described as "data". On the other hand the ABR
samples are also not recognized. Many are described first with as "GDSII
Stream file" with some times obviously wrong and high version numbers. Many
are described also as "TTComp archive data, binary, 4K dictionary" (see
appended file-k-5.45.txt in output). For theses TFM samples here mime type
application/x-tex-tfm is not shown (see appended file-i-5.45.txt in
output). Here no file name suffix is shown (see appended file-ext-5.45.txt in
output).
Luckily i found page about TeX Font Metrics on file formats archive team web
site and on Wikipedia. So i use the first because the Wikipedia link is there
also mentioned and furthermore link to download samples are here listed. So
the reference URL in new definition is expressed by line like:
<RefURL>http://fileformats.archiveteam.org/wiki/TeX_Font_Metrics/RefURL>
So i run tridscan on my samples to generate tfm-tex-0x40.trid.xml. Afterwards
i tried to understand the generated constructs and look if these are always
true. I just thought it is like other variants with just some more words in
data header, but unfortunately this is less than half of the truth.
According to mentioned specification the six-word (24-byte) file header
contains twelve unsigned 16-bit integers which describes general TFM
characteristics (the length of the file, the range of character codes
contained in the font, and the size of each of the tables). According to
specification i patched file command ( See appended file.tmp in output).
On specification are some formulas listed like:
bc-1 <= ec < =255
ne <=256
lf=6+lh+(ec-bc+1)+nw+nh+nd+ni+nl+nk+ne+np
The mentioned specification is an archived version on archive.org dated about
2012. Obviously these described items does not match "newer exotic" fonts like
Japanese. So i assume the described items only apply in full truth for fonts
with 8 bits or lower. The bc values at offset 4 in my samples was like 108 or
214 The ec values at offset 6 in my samples was 18 (12 hexadecimal). The last
is expressed by XML construct that looks like:
<Bytes>0012000000</Bytes>
<Pos>6</Pos>
So here ec is lower than bc.
If try to convert like in other variants by running a command line tool like:
tftopl tgoth10.tfm tgoth10.pl
I got output like:
This is TFtoPL, Version 3.3 (MiKTeX 24.3)
There's some extra junk at the end of the TFM file,
but I'll proceed as if it weren't there.
The character code range 214..18 is illegal!
Sorry, but I can't go on; are you sure this is a TFM?
In other variants the number of words file length is stored as 2 byte integer
in big endian at offset 0. By multiplying this value with 4 the file size in
bytes can be obtained. In this variant this information is stored at offset 4
bytes higher.
Apparently also the coding scheme name is stored at higher offset (37=32+4
compared with other variants). So after coding scheme name (maximal 39) like
(JIS X0208) (TEX KANJI TEXT) (UNSPECIFIED) the remaining 25 padding bytes in
my examples are expressed by XML construct like:
<Bytes>00000000000000000000000000000000000000000000000000</Bytes>
<Pos>51</Pos>
Afterwards at offset 77 (4 bytes higher than compared with other variants)
font family name (like MINCHO, GOTHIC, UNSPECIFIED or 'OTF KANJI' maximal 19)
is stored. Here at offset 96 (92 plus 4 compared with other variants) again
seems to be stored seven bit safe byte with value 80h. That is expressed by
XML construct like:
<Bytes>0000000000000000800000</Bytes>
<Pos>88</Pos>
But i do not know if this always true. So i keep it at the moment and mention
my observations in remark line.
Now comes the interesting part. At offset 2 the length of the header data is
stored in word units. For some dozens of my inspected TFM samples this value
is 64 (=0x40). The samples in this session all have this value. Together with
other parts this is expressed by XML construct like:
<Bytes>004000</Bytes>
<ASCII> . @</ASCII>
<Pos>2</Pos>
As far as i can see there exist less than dozen of variants with other lh
values. In other variants the lh values does change only a little bit. I will
handle the other variants in a future session.
Compared with other variants i get more patterns. Because mentioned
specification does not fully match i do not exactly know how to interpret
these pattern and if these are always true. So i keep must patterns. At higher
offsets after after data header (280=64*4+24) i get short nil sequences like:
<Pattern>
<Bytes>00</Bytes>
<Pos>278</Pos>
</Pattern>
...
<Pattern>
<Bytes>0000</Bytes>
<Pos>430</Pos>
</Pattern>
I assume that these are triggered by lucky circumstances ( too few
examples). So i delete these patterns.
Unfortunately i found a few dozens of samples with lh=40 which does not fit
with my definition. In that samples i found no ASCII strings like for coding
scheme name and font family name. Maybe that Japanese names are stored in
UTF-16 or similar. Maybe i try to handle such samples in future session
With the new definition most TFM samples with header size 0x40 are now
recognized and described (see appended trid-v-new.txt trid-new.txt in
output). The definition is "good", that it does not misidentifies non TFM
samples. And because of some more conditions compared with other variants the
description as "TeX Font Metric" comes first.
TrID definitions, some samples and output are stored in archive
tfm_0x40.zip. I hope that my definition can be used in future version of
triddefs. As mentioned there exist other variants of TFM. I will try to handle
these in a future session.
With best wishes
J?rg Jenderek
some days ago i looked at the content of an exotic CD-ROM. There are also
stored samples which are misidentified. The samples have TFM file name suffix
There exist other variants. So in this session i will handle only variant with
lh=0x40. I will explain later what this means. I found few dozens of such
samples (like tgoth10.tfm in standard directory with parent directory
ptex-fonts inside fonts sub directory tfm) after installing MiKTeX version
23.12 on Windows. On Linux Mint 21.2 i found such samples as part of
texlive-lang-japanese package with version 2021.20220204-1.
So i run trid utility on my TFM samples with lh=0x40. The samples are not
recognized. Many are described wrong as "Adobe PhotoShop Brush" by
abr.trid.xml with file name suffix (.ABR). Some real ABR samples are described
as "TTComp archive compressed (bin-4K)" by ark-ttcomp-bin-4k.trid.xml (see
appended trid-v-old.txt in output).
It took some time to get few of non TFM samples which matches the
misidentified TFM samples.
For comparison reason i also run the file format identification utility DROID
(See https://sourceforge.net/projects/droid/). Here no sample is recognized.
For comparison reason i also run file command (version 5.45) on such
samples. Here these TFM samples are not recognized and not described as "TeX
font metric data". These are described as "data". On the other hand the ABR
samples are also not recognized. Many are described first with as "GDSII
Stream file" with some times obviously wrong and high version numbers. Many
are described also as "TTComp archive data, binary, 4K dictionary" (see
appended file-k-5.45.txt in output). For theses TFM samples here mime type
application/x-tex-tfm is not shown (see appended file-i-5.45.txt in
output). Here no file name suffix is shown (see appended file-ext-5.45.txt in
output).
Luckily i found page about TeX Font Metrics on file formats archive team web
site and on Wikipedia. So i use the first because the Wikipedia link is there
also mentioned and furthermore link to download samples are here listed. So
the reference URL in new definition is expressed by line like:
<RefURL>http://fileformats.archiveteam.org/wiki/TeX_Font_Metrics/RefURL>
So i run tridscan on my samples to generate tfm-tex-0x40.trid.xml. Afterwards
i tried to understand the generated constructs and look if these are always
true. I just thought it is like other variants with just some more words in
data header, but unfortunately this is less than half of the truth.
According to mentioned specification the six-word (24-byte) file header
contains twelve unsigned 16-bit integers which describes general TFM
characteristics (the length of the file, the range of character codes
contained in the font, and the size of each of the tables). According to
specification i patched file command ( See appended file.tmp in output).
On specification are some formulas listed like:
bc-1 <= ec < =255
ne <=256
lf=6+lh+(ec-bc+1)+nw+nh+nd+ni+nl+nk+ne+np
The mentioned specification is an archived version on archive.org dated about
2012. Obviously these described items does not match "newer exotic" fonts like
Japanese. So i assume the described items only apply in full truth for fonts
with 8 bits or lower. The bc values at offset 4 in my samples was like 108 or
214 The ec values at offset 6 in my samples was 18 (12 hexadecimal). The last
is expressed by XML construct that looks like:
<Bytes>0012000000</Bytes>
<Pos>6</Pos>
So here ec is lower than bc.
If try to convert like in other variants by running a command line tool like:
tftopl tgoth10.tfm tgoth10.pl
I got output like:
This is TFtoPL, Version 3.3 (MiKTeX 24.3)
There's some extra junk at the end of the TFM file,
but I'll proceed as if it weren't there.
The character code range 214..18 is illegal!
Sorry, but I can't go on; are you sure this is a TFM?
In other variants the number of words file length is stored as 2 byte integer
in big endian at offset 0. By multiplying this value with 4 the file size in
bytes can be obtained. In this variant this information is stored at offset 4
bytes higher.
Apparently also the coding scheme name is stored at higher offset (37=32+4
compared with other variants). So after coding scheme name (maximal 39) like
(JIS X0208) (TEX KANJI TEXT) (UNSPECIFIED) the remaining 25 padding bytes in
my examples are expressed by XML construct like:
<Bytes>00000000000000000000000000000000000000000000000000</Bytes>
<Pos>51</Pos>
Afterwards at offset 77 (4 bytes higher than compared with other variants)
font family name (like MINCHO, GOTHIC, UNSPECIFIED or 'OTF KANJI' maximal 19)
is stored. Here at offset 96 (92 plus 4 compared with other variants) again
seems to be stored seven bit safe byte with value 80h. That is expressed by
XML construct like:
<Bytes>0000000000000000800000</Bytes>
<Pos>88</Pos>
But i do not know if this always true. So i keep it at the moment and mention
my observations in remark line.
Now comes the interesting part. At offset 2 the length of the header data is
stored in word units. For some dozens of my inspected TFM samples this value
is 64 (=0x40). The samples in this session all have this value. Together with
other parts this is expressed by XML construct like:
<Bytes>004000</Bytes>
<ASCII> . @</ASCII>
<Pos>2</Pos>
As far as i can see there exist less than dozen of variants with other lh
values. In other variants the lh values does change only a little bit. I will
handle the other variants in a future session.
Compared with other variants i get more patterns. Because mentioned
specification does not fully match i do not exactly know how to interpret
these pattern and if these are always true. So i keep must patterns. At higher
offsets after after data header (280=64*4+24) i get short nil sequences like:
<Pattern>
<Bytes>00</Bytes>
<Pos>278</Pos>
</Pattern>
...
<Pattern>
<Bytes>0000</Bytes>
<Pos>430</Pos>
</Pattern>
I assume that these are triggered by lucky circumstances ( too few
examples). So i delete these patterns.
Unfortunately i found a few dozens of samples with lh=40 which does not fit
with my definition. In that samples i found no ASCII strings like for coding
scheme name and font family name. Maybe that Japanese names are stored in
UTF-16 or similar. Maybe i try to handle such samples in future session
With the new definition most TFM samples with header size 0x40 are now
recognized and described (see appended trid-v-new.txt trid-new.txt in
output). The definition is "good", that it does not misidentifies non TFM
samples. And because of some more conditions compared with other variants the
description as "TeX Font Metric" comes first.
TrID definitions, some samples and output are stored in archive
tfm_0x40.zip. I hope that my definition can be used in future version of
triddefs. As mentioned there exist other variants of TFM. I will try to handle
these in a future session.
With best wishes
J?rg Jenderek