Author Topic: tfm-tex-0x40.trid.xml for TeX Font Metric; variant with lh=0x40  (Read 470 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
Hello trid users,

some days ago i looked at the content of an exotic CD-ROM. There are also
stored samples which are misidentified. The samples have TFM file name suffix

There exist other variants. So in this session i will handle only variant with
lh=0x40. I will explain later what this means. I found few dozens of such
samples (like tgoth10.tfm in standard directory with parent directory
ptex-fonts inside fonts sub directory tfm) after installing MiKTeX version
23.12 on Windows. On Linux Mint 21.2 i found such samples as part of
texlive-lang-japanese package with version 2021.20220204-1.

So i run trid utility on my TFM samples with lh=0x40. The samples are not
recognized. Many are described wrong as "Adobe PhotoShop Brush" by
abr.trid.xml with file name suffix (.ABR). Some real ABR samples are described
as "TTComp archive compressed (bin-4K)" by ark-ttcomp-bin-4k.trid.xml (see
appended trid-v-old.txt in output).

It took some time to get few of non TFM samples which matches the
misidentified TFM samples.

For comparison reason i also run the file format identification utility DROID
(See https://sourceforge.net/projects/droid/). Here no sample is recognized.

For comparison reason i also run file command (version 5.45) on such
samples. Here these TFM samples are not recognized and not described as "TeX
font metric data". These are described as "data". On the other hand the ABR
samples are also not recognized. Many are described first with as "GDSII
Stream file" with some times obviously wrong and high version numbers. Many
are described also as "TTComp archive data, binary, 4K dictionary" (see
appended file-k-5.45.txt in output). For theses TFM samples here mime type
application/x-tex-tfm is not shown (see appended file-i-5.45.txt in
output). Here no file name suffix is shown (see appended file-ext-5.45.txt in
output).

Luckily i found page about TeX Font Metrics on file formats archive team web
site and on Wikipedia. So i use the first because the Wikipedia link is there
also mentioned and furthermore link to download samples are here listed. So
the reference URL in new definition is expressed by line like:
 <RefURL>http://fileformats.archiveteam.org/wiki/TeX_Font_Metrics/RefURL>

So i run tridscan on my samples to generate tfm-tex-0x40.trid.xml. Afterwards
i tried to understand the generated constructs and look if these are always
true. I just thought it is like other variants with just some more words in
data header, but unfortunately this is less than half of the truth.

According to mentioned specification the six-word (24-byte) file header
contains twelve unsigned 16-bit integers which describes general TFM
characteristics (the length of the file, the range of character codes
contained in the font, and the size of each of the tables). According to
specification i patched file command ( See appended file.tmp in output).

On specification are some formulas listed like:
    bc-1 <= ec < =255
    ne <=256
    lf=6+lh+(ec-bc+1)+nw+nh+nd+ni+nl+nk+ne+np

The mentioned specification is an archived version on archive.org dated about
2012. Obviously these described items does not match "newer exotic" fonts like
Japanese. So i assume the described items only apply in full truth for fonts
with 8 bits or lower.  The bc values at offset 4 in my samples was like 108 or
214 The ec values at offset 6 in my samples was 18 (12 hexadecimal). The last
is expressed by XML construct that looks like:
   <Bytes>0012000000</Bytes>
   <Pos>6</Pos>
So here ec is lower than bc.

If try to convert like in other variants by running a command line tool like:
   tftopl tgoth10.tfm tgoth10.pl
      I got output like:
This is TFtoPL, Version 3.3 (MiKTeX 24.3)
There's some extra junk at the end of the TFM file,
but I'll proceed as if it weren't there.
The character code range 214..18 is illegal!
Sorry, but I can't go on; are you sure this is a TFM?

In other variants the number of words file length is stored as 2 byte integer
in big endian at offset 0. By multiplying this value with 4 the file size in
bytes can be obtained. In this variant this information is stored at offset 4
bytes higher.

Apparently also the coding scheme name is stored at higher offset (37=32+4
compared with other variants). So after coding scheme name (maximal 39) like
(JIS X0208) (TEX KANJI TEXT) (UNSPECIFIED) the remaining 25 padding bytes in
my examples are expressed by XML construct like:
   <Bytes>00000000000000000000000000000000000000000000000000</Bytes>
   <Pos>51</Pos>

Afterwards at offset 77 (4 bytes higher than compared with other variants)
font family name (like MINCHO, GOTHIC, UNSPECIFIED or 'OTF KANJI' maximal 19)
is stored. Here at offset 96 (92 plus 4 compared with other variants) again
seems to be stored seven bit safe byte with value 80h. That is expressed by
XML construct like:
   <Bytes>0000000000000000800000</Bytes>
   <Pos>88</Pos>
But i do not know if this always true. So i keep it at the moment and mention
my observations in remark line.

Now comes the interesting part. At offset 2 the length of the header data is
stored in word units. For some dozens of my inspected TFM samples this value
is 64 (=0x40). The samples in this session all have this value. Together with
other parts this is expressed by XML construct like:
   <Bytes>004000</Bytes>
   <ASCII> . @</ASCII>
   <Pos>2</Pos>

As far as i can see there exist less than dozen of variants with other lh
values. In other variants the lh values does change only a little bit. I will
handle the other variants in a future session.

Compared with other variants i get more patterns. Because mentioned
specification does not fully match i do not exactly know how to interpret
these pattern and if these are always true. So i keep must patterns. At higher
offsets after after data header (280=64*4+24) i get short nil sequences like:
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>278</Pos>
   </Pattern>
   ...
   <Pattern>
      <Bytes>0000</Bytes>
      <Pos>430</Pos>
   </Pattern>
I assume that these are triggered by lucky circumstances ( too few
examples). So i delete these patterns.

Unfortunately i found a few dozens of samples with lh=40 which does not fit
with my definition. In that samples i found no ASCII strings like for coding
scheme name and font family name. Maybe that Japanese names are stored in
UTF-16 or similar. Maybe i try to handle such samples in future session

With the new definition most TFM samples with header size 0x40 are now
recognized and described (see appended trid-v-new.txt trid-new.txt in
output). The definition is "good", that it does not misidentifies non TFM
samples. And because of some more conditions compared with other variants the
description as "TeX Font Metric" comes first.

TrID definitions, some samples and output are stored in archive
tfm_0x40.zip. I hope that my definition can be used in future version of
triddefs. As mentioned there exist other variants of TFM. I will try to handle
these in a future session.

With best wishes
J?rg Jenderek

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Re: tfm-tex-0x40.trid.xml for TeX Font Metric; variant with lh=0x40
« Reply #1 on: April 17, 2024, 03:09:40 AM »
Thanks but, it seems that the actual last *.trid.xml file is missing (there only a lot of backups).

Also, I see that that seems to be really a lot of possibile variant, sometimes with very little patterns/differences...
Maybe it would be better to just keep the 2/3 more common ones, if you can identify them.
« Last Edit: April 17, 2024, 03:15:30 AM by Mark0 »

jenderek

  • Sr. Member
  • ****
  • Posts: 375
Hello trid users,

some days ago i looked at the content of an exotic CD-ROM. There are also
stored samples which are misidentified. The samples have TFM file name suffix.

There exist other variants. So in this session i will handle only variant with
lh=0x40. I have forgotten to append the last definition
tfm-tex-0x40.trid.xml (so this now in tfm-tex-0x40.trid.zip)

I though i get all variants. When considering TFM samples on Windows i
was at the end after a dozen. But when considering samples on Linux Mint
(version 21.3) i get some more variants but at least maximal 30 as far
as i can see. The problem for me was that tridscan does not work with
path on Mint. So i must transfer TFM to Windows system or in a sample
directory and the do scanning procedure. So i will summarize my
results that i get at the moment in following table:

definition or lh      #files   #files   #files   "name"
            trid   windows   mint
tfm-tex-0x12.trid.xml      647   12328   11634   data~2
tfm-tex-0x11.trid.xml      170   245   1431   data~1
tfm-tex-0x02.trid.xml      314   11   392   data~4
tfm-tex-0x15.trid.xml      69   81   21   data~3
tfm-tex-0x40.trid.xml      46   14   63   data~8
tfm-tex-0x40-foo.trid.xml   28   include   include   data~8
tfm-tex-0x78.trid.xml      24   25   31   data~7
01h            9   9   1603   data~5
21h            8   8   17   data~6
75h               0   2   data~9
79h               0   18   data~10
2ah               0   12   data~11
23h               0   196   data~12
247h               0   14   data~13
45h               0   14   data~14
272               0   14   data~15
77h               0   98   data~16
5bh               0   14   data~17
38h               0   98   data~18
e0h               0   296   data~19
24h               0   4   data~20
126h               0   296   data~21
33h               0   4   data~22
17h               0   258   data~23
71h               0   4   data~24
32h               0   6   data~25
2dh               0   4   data~26
30h               0   2   data~27
36h               0   4   data~28
92h               0   2   data~29
5ah               0   2   data~30
else $RECYCLE.BIN $IX19EC1.tfm   -   1   
sum            1315   12722   16554

In column "name" is listed how i call it at the moment in patched file command
(see appended file.tmp in output).

Like in all concerns you must balance the pro and contra items.

So samples described as "data~2" have a recognition rate of 70 percent
and variant described as "data~1" have a rate of 8 percent. At first
glance this sound great with a rate of about 80 %.

If you say you only want to keep "most important" variants than at
least "data~5" with samples in thousand range (1603) must also be kept.

Then you must look at the aim or goal of TrID utility. One purpose is
"restoration". If after a file system crash you have lost the
correlation of files to the names and directories and you have then
ten thousands of files you have about two thousands of "unrecognized" files
in case of TFM. So this rate is still too low to "restore" your system
for example.

I do not hate gamers, but when you add definition for games like GTA
or exotic audio formats like "PC9801 rip" by m-mod.trid.xml you should
first concentrate that the files of the operating system and of
important/relevant components like web browser are already
described. So you must answer how relevant is the TEX system compared
with games for example. Some decades ago when i was studying on all scientific
institutes at my university are using TeX/Latex. This was the only
software that can handle formulas and does not cost money like
software products from Adobe or Apple. Today many Office suite can also
does this work but older publications are done by TeX as
word processor. So this system has a higher relevance compared with
games or similar in my opinion.

Like in virus scanner "wrong results" are annoying. So for many
examples i get with low priority description  as "Adobe PhotoShop
Brush" by abr.trid.xml. So when you keep such unreliable definition why
do refuse to add definitions for TFM?

The next question is how many efforts must be done to describe TFM
samples. So this is not so complicated and not rocket science. It is just
hard work and little knowledge. The main classification is done by
data header size (lh value). Unfortunately this field is only 16 bit. So the
recognition done only by this field is not reliable. When considering all 24
bytes of header i get more patterns like values are lower 256, which
maybe are not always true but make definition for TFM different from other
file formats. The lh values determines the size of the following data
header. So you may add pattern describing the coding, font family name and
seven bit safe byte. Then the definition should be unique enough and the
efforts for TFM files is manageable.

You may cry that are so many definitions, but in the end you must look
that 30 definitions considering with hard work only about 24 bytes gives
720 bytes totally that described all TeX TFM samples with about 100%
rate (at least on typically Linux Mint system).

That is also important when you consider another aspect that is
provided by TrID. That item is "security" as it is provided by
virustotal for example. If you only have a classification rate of about
80 % then this means 20 % are unclassified and must be considered as
potentially location for malicious code.

So i will continue and deliver definitions for all 30 variants.

With best wishes
J?rg Jenderek