Author Topic: updated hxs.trid.xml for Microsoft compiled help format 2.0 + replacement/varian  (Read 755 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
Hello trid users,

Some days ago i handled some MZ executables. Some have the file name
extension HXS.

So i run trid utility on my dozen HXS examples. All are described correctly
as "Microsoft compiled help format 2.0" by hxs.trid.xml, but with a low
recognition rate in ten range. Often first description is "Win32 Executable
(generic)" by exe-win.trid.xml with range in one third range. But i also
found sample like POWERPNT.HXS where first description is "GRASP animation"
by gl-grasp.trid.xml (See appended output/trid-v-old.txt).

For comparison reason i also run file command (newest version 5.43) on such
samples. Here these are described as "PE32 executable" with sub
classification as "(DLL)" and "Intel 80386" but with wrong mime type and
file name suffix (See appended output/file-5.43.txt). The newest file
command (msdos,v 1.161 2022/11/21) now describes the HXS samples correctly
with sub classification as "(Microsoft compiled help format 2.0)". It also
show some additional information like "2 sections". That is used by this
version as additional test criterium (See appended output/file.txt).
Furthermore it now shows a reasonable mime type (See appended
output/file-i.txt) and the correct file name suffix (See appended
output/file-ext.txt)

For comparison reason i also run the file format identification utility
DROID ( See https://sourceforge.net/projects/droid/). Here all real samples
are described at least as "Windows Portable Executable" with mime type
application/vnd.microsoft.portable-executable. This accepts the EXE
extension but not the EFI and HXS extension (EXTENSION_MISMATCH true).
The samples, which are described by file command with phrase "PE32+
executable" are described as 64 bit version variant via additional test for
byte sequence 0B02 by PUID fmt/900.  The samples, which are detected by file
command and described with phrase "PE32 executable" are described as 32 bit
version variant via additional test for byte sequence 0B01 by PUID fmt/899.
The samples not detected by file command are described as "Windows Portable
Executable" without any version field by by PUID x-fmt/411 (See appended
droid-hxs-cpl.csv).

In current definition a HTML page itolitlsformat.html on web site
speakeasy.org was used. Unfortunately this server site does not exist any
more, but this page can be found on web site russotto.net. So i could use
this as new reference, but i do not do this. This link to Russotto is also
mentioned in reference on web page about Microsoft Help 2 on Wikipedia. So
this now becomes the new reference URL replacing the old non existing. This
is now expressed by line like:
    <RefURL>http://www.russotto.net/chm/itolitlsformat.html</RefURL>

The current definition has no mime type and there exist no explicit one for
HXS samples. But in the documentation is written that the help content is
embedded inside as PE executable container and for this format exist a mime
type. So HXS examples should get mime type of container format. That is done
by DROID and newest file command. So this is now expressed by line like:
   <Mime>application/vnd.microsoft.portable-executable</Mime>

By current definition the HXS samples are recognized, but not as first
entry. The recognition rate could be greater because the definition is
stripped too much. I will explain why this is true.

So run tridscan on my dozen HXS examples and generate a replacement variant
hxs-386.trid.xml. In current definition inside front block exist a XML
pattern like:
   <Bytes>0040000000504500004C</Bytes>
   <ASCII> . @ . . . P E . . L</ASCII>
   <Pos>59</Pos>
If a XML construct contains a string like pattern then inside global strings
an equivalent line should occur like:
   <String>PE''L</String>
But the current does not contain a global string section. If you have a
complete and official file specification you can strip the definition to
relevant parts for recognition. Then the string line in is redundant.
But if you do not know the file format and add samples where characteristic
strings appear at other offset, then the XML part completely vanish whereas
the string line will survive.

The byte sequence 50450000 is the the starting string pattern PE\0\0 for PE
executables. So the above XML construct stripped down to relevant part would
become like:
   <Bytes>504500004C</Bytes>
   <ASCII> P E . . L</ASCII>
   <Pos>64</Pos>

So that means new exe header starts at offset 64 (hexadecimal 40). The
offset to new exe header is stored at offset 60 (hexadecimal 3C) as 4 byte
little endian integer variable e_lfanew. With this information this becomes
like:
   <Bytes>4000000050450000</Bytes>
   <ASCII> @ . . . P E . .</ASCII>
   <Pos>60</Pos>

According to Microsoft documentation after the PE magic comes the machine
type as 2 byte little endian integer. For Intel 386 as shown by file command
this value is 014c (in little endian or 4C01 as byte sequence or L\001 as
string). If there exist variant for x64 marked by phrase "x86-64" via file
command that value would be 8664 (in little endian or 6486 as byte sequence
or d\x86 as string). So if all examples are for Intel 386 then the this
construct should look like:
   <Bytes>40000000504500004C01</Bytes>
   <ASCII> @ . . . P E . .L</ASCII>
   <Pos>60</Pos>

After the machine type the number of sections is stored as 2 byte little
endian integer. Reported by file command is for HXS examples is here the
value 2. For real PE executables or libraries this value is higher. Three is
a typical value for sections of code, data and resources. As in the
documentation mentioned one section contains the help content. This section
is labeled .its. Apparently the HXS examples also contain another section
for resources.

So now most Windows executables and libraries are now excluded by XML
construct which become like:
   <Bytes>40000000504500004C010200</Bytes>
   <ASCII> @ . . . P E . .L</ASCII>
   <Pos>60</Pos>
The information about HXS examples can be verified by command like:
   pelook.exe -h WINWORD.HXS
So we see that resource section apparently has name .rsrc. So this would be
expressed inside global string section by lines like:
   <String>.RSRC</String>
   <String>.ITS</String>

If the sections appear at the same offset that would be expressed inside
front block by XML constructs like:
   <Bytes>2e72737263000000</Bytes>
   <ASCII> . r s r c</ASCII>
   <Pos>313</Pos>
   <Bytes>2E697473</Bytes>
   <ASCII> . i t s</ASCII>
   <Pos>352</Pos>
In current definitions these looks like:
   <Bytes>72737263000000</Bytes>
   <ASCII> r s r c</ASCII>
   <Pos>313</Pos>
   <Bytes>402E697473</Bytes>
   <ASCII> @ . i t s</ASCII>
   <Pos>351</Pos>
The Intel compiler by default generate section names starting with point
character(. = 2E) followed by low cases whereas some Borland compiler
produce names without point character and with upper cases. So if one
section has name .its then the other section name could not be rsrc with
standard configuration. The correct second name must be .rsrc. So apparently
the current definition is here also stripped down too much.

Because HXS samples are not object file so these contains an optional header
after new exe header (24 bytes). If this starts in examples at offset 64
(40h) then the optional header starts at offset 88 (64+24). The option
header starts with a 2 byte magic. The values 0x107 identifies it as a ROM
image, and 0x20B identifies it as a PE32+ executable. The HXS examples are
described as "PE executable". The corresponding magic for such files is
0x10B ( in little endian or 0B01 as byte sequence). So this information
would be expressed by XML construct like:
   <Bytes>0B01</Bytes>
   <Pos>88</Pos>

At relative offset 68 in optional header the sub system is stored as 2 byte
little endian integer. So if optional header starts at 88 then this field
appears at absolute offset 156 (=88+68). The subsystem dield is required to
run this image. Typical value is 2 for GUI windows or 3 for Windows CUI.
Because HXS contain no executable code so i find here the value 0 for
unknown subsystem. So this is one real significant part to distinguish HXS
from others like DLL.  The only PE executable i found with unknown subsystem
was joy.cpl from WINE 1.7.28, but here i found more PE sections as
expected. These 3 sections are .text, .reloc and .rsrc, but no .its section.
So the distinguishing XML construct is like:
   <Bytes>0000</Bytes>
   <Pos>156</Pos>

So i run tridscan on my dozen HXS examples to generate hxs-386.trid.xml.
Unfortunately this definition is based on only dozen examples. So i get many
nil patterns and many string lines. So i strip down this definition to
hxs-minmal.trid.xml according to observation and considerations done above.
I also keep mentioned key words inside front block and global string section
like:
      <String>{0A9007C6-4076-11D3-8789-0000F8105754}</String>
      <String>MSCOMPRESSED</String>
      <String>ITOLITLS</String>
      <String>IFCM</String>
      <String>AOLL</String>
      <String>CAOL</String>
      <String>LZXC</String>

With the 386 variant of trid definition now my HXS examples are still
described but this description comes first. With the minimal variant my HXS
examples are still described but rate is higher than for all EXE concerning
definitions (see appended output/trid-v-news.txt). TrID definitions and
output are stored in archive hxs.zip. I hope that my variants can be used in
future version of triddefs. If the variants are working OK then the generic
hxs.trid.xml can be deleted to avoid low recognition rate.

With best wishes
Jörg Jenderek

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2731
    • Mark0's Home Page
I will update immediately the existing HXS definition with your updated one.
Then I'll try to test the other two with some other XHS samples.
Thanks!