Author Topic: replaced o-coff.trid.xml for Intel 80386 COFF object *.o *.obj + variants  (Read 1456 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
Hello trid users,

some days ago i handled FNT Windows fonts. Some like ega80woa.fnt and
svgafix.fnt are misidentified as "Intel ia64 COFF object file" by file
command. So i looked at about thousand of COFF samples. These have file name
extension o and obj.

Many are identified as "Intel 80386 Common Object File Format (COFF) object"
by o-coff.trid.xml (See appended OK/output/trid.txt), but some are not
recognized (See appended output/trid.txt). But the file command identifies
the samples correctly as "Intel 80386 COFF object file" (See
output/file-5.39.txt).

So i first tried to update the trid definition file o-coff.trid.xml by
running tridscan. Second pattern was:
   <Bytes>012E</Bytes>
   <Pos>19</Pos>
This now becomes shorter like:
   <Bytes>2E</Bytes>
   <Pos>20</Pos>
After some consideration i believe that this is not always true. I will
explain this later. So only one 2 byte pattern is left. That looks like:
   <Bytes>4C01</Bytes>
   <ASCII> L</ASCII>
   <Pos>0</Pos>
According to DJGPP COFF specification this is the I386MAGIC (magic 0x014c in
little endian).
In the documentation of the file command is written that at least 4 bytes
should be used for identification purpose.

So i through away the current trid definition and start with a new one. I
run tridscan on a few examples, look in the output of file command and
refine definition by more examples while interpreting patterns with the help
of the COFF specifications.

At offset 2 the number of sections is stored as 2 byte integer f_nscns. For
Intel COFF samples little endian is used whereas for Hitachi SH this can
also be big endian. After testing a few hundreds COFF examples i get f_nscns
values like:
 1 2 3 4 5 7 8 9 11 12 16 19 20 21 22 30 36 40 42 56 80 89 96 124

So real COFF samples have at least 1 section. Typically COFF samples
have only a few sections for code, data etc. The worst case with highest
f_nscns value 124 was msvcrt.lib. So i assume that real maximal f_nscns is
in hundreds range. So the upper byte of f_nscns is probably always nil. That
is expressed by XML construct like:
   <Bytes>00</Bytes>
   <Pos>3</Pos>

Many values inside COFF examples are stored as 4 byte little endian
integer. So in theory a maximal unsigned value of 4294967295 (4 GiB) could
be stored. Especially for file pointer this value is lower than file size.

But when looking in output of file command with magic file for such COFF
examples i saw that many not large values for symbol table pointer f_symptr
are stored as 4 byte integer at offset 8. The highest value is found was
0x35120. So upper byte seems to be nil. That is expressed by XML construct
like:
   <Bytes>00</Bytes>
   <Pos>11</Pos>
That is true as no pointer exceed the 16 MiB limit.

At offset 12 the number of symbols are stored as 4 byte integer value
f_nsyms. The worst or largest value was 1546. So the two upper bytes seems
to be nil. That is true if the number of symbols does not exceed the 65536
(64 KiB) limit.
At offset 16 the optional header size is stored as 2 byte integer f_opthdr.
In the documentation is written that object files should have a value of
0. For COFF executables a non zero value is possible.
These two facts are expressed by XML construct like:
   <Bytes>00000000</Bytes>
   <Pos>14</Pos>

The absence of an optional header has some consequence. Now after the COFF
header at offset 20 the section start with its own header. This start with a
8 byte section name string s_name. If looking in output of patched file
command i found typical names like .text or .data. Or in some "exotic" cases
i also found strings like:
   .debug$S .drectve .testseg
So for all my inspected examples the name start with a point character
(0x2E) followed by low case ASCII phrase. That i expressed by XML construct
like:
   <Bytes>2E</Bytes>
   <Pos>20</Pos>
This seems to be typical for Intel and Microsoft compiler suites, but i
remember me that some Borland compiler suite use up cased phrase TEXT
instead of .text and DATA instead of .data. So the above construct is maybe
not always true.

At offset 18 flags are stored as 2 byte integer f_flags. Not all bits are
used. But in the upper byte only the first bit is used.  That is the
F_AR32WR flag. That means that the file is 32-bit little endian. So byte at
offset 19 could have value zero or one. But for example Texas Instruments
COFF objects more bits are used.

At offset 28 the physical and virtual address is stored at 4 byte integer
s_paddr and s_vaddr. For unlinked objects, this address is relative to the
object's address space (i.e. the first section is always at offset zero). So
these values are always nil. That is expressed by XML construct like:
   <Bytes>0000000000000000</Bytes>
   <Pos>28</Pos>

At offset 36 the section size is stored as 4 byte integer s_size. The 4 GiB
limit seems to be never reached. So for inspected samples the upper byte is
nil. That is expressed by XML construct like:
   <Bytes>00</Bytes>
   <Pos>39</Pos>

At offset 40 file pointer to raw data for first section is stored as
s_scnptr. This seems to be low (64 KiB limit). So the two upper bytes are
nil. That is expressed by XML construct like:
   <Bytes>0000</Bytes>
   <Pos>42</Pos>

At offset 44 file pointer to relocation of first section is stored as
s_relptr. This seems to be low (16 MiB limit). So the upper byte is nil.
That is expressed by XML construct like:
   <Bytes>00</Bytes>
   <Pos>47</Pos>

At offset 48 file pointer to gp histogram of first section is stored as
s_lnnoptr. This seems seems to be low (64 KiB limit). So the two upper bytes
are nil. That is expressed by XML construct like:
   <Bytes>0000</Bytes>
   <Pos>50</Pos>

Patterns at offset higher than 60 at part of section data. So i delete all
such high patterns like:
   <Pattern>
      <Bytes>00000000</Bytes>
      <Pos>88</Pos>
   </Pattern>
   <Pattern>
      <Bytes>000000</Bytes>
      <Pos>93</Pos>
   </Pattern>
With this trid definition now all my 80386 COFF objects are recognized (See
appended output/trid-v-new.txt). If somebody find bigger COFF objects then
some nil patterns in definitions will become shorter or vanish.

Instead of the generic mime type application/octet-stream now i use a user
defined one. That is expressed by line like:
   <Mime>application/x-coff</Mime>

Still 716 other COFF are not recognized by TrID. But the file command
identifies these examples as "Intel amd64 COFF object file" ( See appended
amd64/output/file-5.39.txt).
I run tridscan on a few examples, look in the output of file command and
refine definitions by more examples while interpreting patterns with the
help of the COFF specifications. So i get o-coff-amd64.trid.xml. Because all
COFF have the same header, then the only difference is 2 byte magic at the
beginning and maybe some different flags. So if one pattern becomes shorter
or vanish then the same should happen in the other COFF definitions. So in
amd64 variant the second different pattern was like:
   <Bytes>0000000000</Bytes>
   <Pos>13</Pos>
This now becomes short for bigger f_nsyms values. So this then becomes like:
   <Bytes>00000000</Bytes>
   <Pos>14</Pos>

All my inspected amd64 examples have an text and data section. That was
expressed inside global string section by lines like:
   <String>.DATA</String>
   <String>.TEXT</String>
And these sections also begins at fixed positions. That was described by XML
constructs like:
   <Pattern>
      <Bytes>002E746578740000000000000000000000</Bytes>
      <ASCII> . . t e x t</ASCII>
      <Pos>19</Pos>
   </Pattern>
   <Pattern>
      <Bytes>0000200050602E646174610000000000000000000000</Bytes>
      <ASCII> . .   . P ` . d a t a</ASCII>
      <Pos>54</Pos>
   </Pattern>
But when synchronizing with i386 examples, we known that this is not always
true. So only the point character of first section is assumed to be always
true. So these above constructs now becomes like:
   <Bytes>2E</Bytes>
   <Pos>20</Pos>

The different pattern for AMD variant is the different start magic
IMAGE_FILE_MACHINE_AMD64 with little enian value 0x8664. This is expressed
by XML construct like:
   <Bytes>6486</Bytes>
   <ASCII> d</ASCII>
   <Pos>0</Pos>

According to documentation the file name extension cof could occur, but
unfortunately i found no examples with that extension. According to
documentation COFF libraries use file name extension lib instead of
obj/o. That extensions are used for COFF objects. The libraries examples are
identified by file command and by new trid definitions in the same manner as
COFF objects (See appended lib/output/file.tmp amd64/lib/output/file.tmp).
Unfortunately i only found 3 COFF libraries. That are variants of
msvcrt.lib. So i do not know if there exist a reliable way to distinguish
libraries and objects. For my 3 COFF libraries the first section name was
".drectve". So at the moment i also add LIB name extension. This is now
expressed by line like:
 <Ext>O/OBJ/LIB</Ext>
and also:
 <FileType>
 Intel amd64 Common Object File Format (COFF) object or library
 </FileType>

In the updated Wikipedia page about COFF it now becomes visible that there
exist COFF for CPU architectures. So i changed in trid definitions the
remark line from
   <Rem>There exist more COFF objects for other CPUs</Rem>
to
   <Rem>
   with limits (256 for f_nscns, 16 MiB f_symptr, 64 KiB f_nsyms,
   16 MiB s_size, 64 KiB s_scnptr, 16 MiB s_relptr, 64 KiB s_lnnoptr)
   and 1st section name starting with point.
   LIB extension is used for libraries.
   </Rem>

Then i create a third trid definition o-coff-ia64.trid.xml. This is for
Intel Itanium CPUs. That is expressed by line like:
   <FileType>
   Intel ia64 Common Object File Format (COFF) object or library
   </FileType>
Unfortunately i only found 1 example. But according to documentation it
shares the most beginning part with Intel 80386 COFF. The difference is the
2 byte start magic. That is here IMAGE_FILE_MACHINE_IA64 with value
0x0200. That is expressed by XML construct like:
   <Bytes>0002</Bytes>
   <Pos>0</Pos>

Unfortunately this pattern also occur for Windows FNT fonts with version 2
(fnt-windows-v2.trid.xml) and DEGAS hi-res bitmap described by
bitmap-pi3-degas.trid.xml (see appended ia64/output/trid-v.txt). So it
becomes clear that magic pattern with only 2 bytes is too unspecific for
recognition. So it is important to keep some additional characteristic nil
patterns.

With the 3 definition all my COFF samples are now recognized (See appended
output/trid-v-new.txt).

Unfortunately there exist also COFF excutables, but i do not found such
examples. There some other bits in flag fields are set and such examples
probably have an additional optional header. So such examples are not
described by my trid definitions.

TrID definition, some examples and output are stored in archive coff.zip. I
hope that my XML files can be used in future version of triddefs.

With best wishes
Jörg Jenderek

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Re: replaced o-coff.trid.xml for Intel 80386 COFF object *.o *.obj + variants
« Reply #1 on: February 08, 2021, 10:44:48 PM »
Many thanks!