Author Topic: replacement for java-class.trid.xml, exe-ub.trid.xml; CAFEBABE magic  (Read 1835 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
Hello trid users,

some days ago i run TrID on dozens of Mac OS X Mach-O universal Dynamically
linked shared Library (*.dylib), Mach-O bundles (*.bundle), Mach-O
executables without filename extension and thousands Java byte-codes
(*.class).

All inspected samples are described by exe-ub.trid.xml as "Mac OS X
Universal Binary executable". So all Java byte-codes are misidentified with
40% possibility as such executables. And the other way an executable like
file is misidentified with 60% rate as Java byte-code (See appended
output/trid-v-old.txt).

The file command {See https://en.wikipedia.org/wiki/File_(command)} has also
some difficulties to distinguish, but it identifies file example correctly
as "Mach-O universal binary" (See appended output/file-k-5.39.txt), because
the file command use another method to detect such binaries.

TrID identifies all such examples by 4-byte magic string at the beginning.
This is identical for Java bytes code and Mach-O universal binary. So in both
definitions files this is expressed in pattern section by XML construct like:
   <Bytes>CAFEBABE</Bytes>
   <Pos>0</Pos>

The difference between two definition files is that java-class.trid.xml
contains global string section an additional line like:
   <String>JAVA</String>
So every Mach-O universal binary which contains the string Java will be
misidentified as Java class file.

So i look how file command does distinguish and then try to adopt this
method for trid in replacement java-class-new.trid.xml.

According to Java class file page on Wikipedia at offset 6 the major version
is stored as 2 byte value in big-endian order. That is in range from 45
(=0x2d for JDK 1.1) to 58 (=0x3A for Java SE 14) in year 2020. The file
command take values above 30 as characteristic for java class files.
Unfortunately Trid has no construct for testing above value, but the upper
byte of majors version is always null assuming that Java never reaches a
major version number of 256 or higher. I assume that is very unlikely. So
major version part is expressed by additional XML construct like:
   <Bytes>00</Bytes>
   <Pos>6</Pos>

At offset 4 the minor version is stored as 2 byte value in big-endian
order. Theoretically a minor version of 65535 can exist. But after testing
some thousand Java class files i only found low values like 0 or 3. So it it
very unlikely that minor version number of 256 or higher exist. So upper
byte of minor version is than also null. That is expressed together with
magic string by XML construct like:
   <Bytes>CAFEBABE00</Bytes>
   <Pos>0</Pos>

Unfortunately this was not sufficient to distinguish Mach-O from Java class
files. At offset 8 the constant pool count is stored, which is apparently
always non zero. That can be verified by output of a patched file command
(See appended output/file.txt). Furthermore i mention this now in remark
line instead instead old comment.

In current java-class.trid.xml the mime type application/java-byte-code was
used. But when looking at IANA site such mime type is not officially
registered. So i replace it with a used defined one (Starting with x- ) that
is shown by file command (See appended output/file-ik-5.39.txt) and by
http://extension.nirsoft.net/class. That is now expressed by line like:
   <Mime>application/x-java-applet</Mime>

In exe-ub.trid.xml is no reference URL. So i add in replacement definition
Wikipedia page about Mach-O file format. That is now expressed by line like:
   <RefURL>https://en.wikipedia.org/wiki/Mach-O</RefURL>

According to documentation is becomes visible that not only Mac OS X
Universal Binary executables like sgdisk and file without file name extension
are described, but also dynamically linked shared libraries ( file name
extension dylib) and Universal Binary bundles ( with file name
extension bundle). So i removed phrase "executable" and replace it by phrase
"(generic)" in replacement definition. So i also changed  TrID definition name to
ub-gen.trid.xml instead exe-ub.trid.xml. The possible different file name
extensions are now shown by line like:
   <Ext>DYLIB/BUNDLE/O</Ext>

Instead generic mime type application/octet-stream now i use the user defined
one that is shown by file command (See appended output/file-ik-5.39.txt).
That is now expressed by line like:
   <Mime>application/x-mach-binary</Mime>

According to file command at offset 4 the number of architectures is stored
as 4 byte value in big-endian order. Typical values are 2 for Mac OS X
Universal Binaries for i386 and x86_64 architectures. Often i also find
examples with value 1 for one of this x86 architectures. I also found a few
samples with value 3 and 1 example with value 4 like
libclang_rt.asan_watchos_dynamic.dylib (See appended output/file.txt).

There seem to exist only about two dozens CPU architectures for embedding.
Highest numbered by file command is 18 for ppc. So in worst or biggest case
a Mach-O universal apparently contains maximal 18 binaries. So file command
considers value below small value 20 as characteristic for Mach-o files
compared with "high" value for Java classes. So such "low" values means that
the 3 upper bytes for architectures number of are null. That is expressed
together with magic string by XML construct like:
   <Bytes>CAFEBABE000000</Bytes>
   <Pos>0</Pos>

According to file command at offset 4 the CPU type is stored as 4 byte
value. The upper byte seems to be 1 for 64-bit architectures architectures
and 0 for 32-bit architectures. The remaining 3 bytes apparently contain low
values like 7 for x86 CPUs, Ch for arm CPUs. Highest mentioned number is 18
for PowerPC. That means the 2 bytes in the middle are null. That is now
expressed by XML construct like:
   <Bytes>0000</Bytes>
   <Pos>9</Pos>

That was sufficient for me to distinguish Java class files from Macho-O.
There at offset 12 the CPU sub type is stored as 4 byte value. So the same
consideration as for CPU type can be done for sub type if needed.

With the 2 trid definitions all my Mac OS X Universal Binaries and Java
class are described correctly ( See appended output/trid-new.txt) and the
recognition rate is raised ( See appended output/trid.txt).

TrID definition, some examples and output are stored in archive
class_macho.zip. I hope that the 2 XML files can be used in future version
of triddefs as replacement after additional tests with other exotic CPU type
architectures.

With best wishes
Jörg Jenderek

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Re: replacement for java-class.trid.xml, exe-ub.trid.xml; CAFEBABE magic
« Reply #1 on: August 25, 2020, 09:44:49 PM »
Will check, thanks!