Hello trid users,
some days days ago i must handle some Linux kernel.
So i run trid utility on such Linux kernel examples. All samples are
described with low priority as "Master Boot Record dump" by
mbr-dump.trid.xml. "New" kernel variants like linux64 sample are described
with highest priority as "Linux kernel x86 bootable" by dat-linux.trid.xml
(See appended output/trid-v-old.txt).
For comparison reason i also run file command (version 5.44) on such
output/file-5.44.txt). Here all samples are recognized and described. When
using keep going option is also get the low priority description. Here the
phrase "DOS/MBR boot sector" is used as describing text. But here all
samples are described by phrase like "Linux kernel" (see appended
output/file-k-5.44.txt)
Unfortunately no file name suffix is shown for "old" kernel variants but for
"new" kernel variant 4 name extensions /dat/bin/lnx are listed (see appended
output/file-ext-5.44.txt) . Only generic application/octet-stream mime type
is shown (see appended output/file-i-5.44.txt).
For comparison reason i also run the file format identification utility
DROID (See
https://sourceforge.net/projects/droid/). Here many "modern"
examples are recognized. and described as "Windows Portable Executable" with
version "64 bit" by PUID fmt/900.
I had difficulties to find such "old" samples , because nowadays most
samples use the "new" format, especially all samples for UEFI
computers. Some people say we have year 2023, so do not care about such old
seldom used variants. Unfortunately the old format is still used by some
tools like boot loader lilo (See
https://en.wikipedia.org/wiki/LILO_(bootloader) or
https://www.joonet.de/lilo/ ) or older memtest tools (Memtest86+ See
https://www.memtest.org/ or
https://en.wikipedia.org/wiki/Memtest86%2B).
The variant is also used for Linux LUKS ( see
https://github.com/jbruchon/elks or
https://en.wikipedia.org/wiki/Embeddable_Linux_Kernel_Subset).
What all the tools describe with phase like "boot" is triggered by boot
signature. That is expressed inside TrID definition in front block section
by XML construct like:
<Bytes>55AAEB</Bytes>
<ASCII> U</ASCII>
<Pos>510</Pos>
According to the linux/x86 boot protocol mentioned by reference URL all
"modern" linux kernel variant are characterised by Magic signature
"HdrS". This is expressed inside dat-linux.trid.xml by XML construct like:
<Bytes>48647253</Bytes>
<ASCII> H d r S</ASCII>
<Pos>514</Pos>
There is also written that this feature is supported since boot protocol
version 2.0. Now you must read between the lines. That also means "older"
Linux kernel variant with boot protocol lower version 2.0 ( that is version
1.x or maybe 0.y) does not have this field. Therefore "older" Linux kernel
are not recognized by current TrID definitions. But file command is able to
recognize such "older" kernel variants.
So i looked inside their database sources (Magdir/linux) how they do it.
Unfortunately the situation is not so clear described. At least the applied
methods test for the starting bytes. All test assume that the first 4 bytes
are b8c0078e. This byte sequence with D8 represents the x86 machine
instruction "mov ax,0x07C0 ; mov ds,ax". There is written that the following
byte sequence should be d8b800908ec0b9000129f629. Obviously this not always
true. It is true for memtest86+ samples but not for the lilo bootsect.b
sample and also not for LUKS linux sample. See following table:
d8b800908ec0b9000129f629 file command
d8b800908ec0b9000129f629 memtest86+
d8b800018ec0b9000131f631 linux LUKS
d8cd122d2000c1e006bb0090 bootsect.b lilo
For lilo sample the following byte sequence represents the x86 machine
instruction "int 0x12 ; sub ax,0x0020 ; shl ax,0x06; mov bx,0x9000"
according to source bootsect.S inside lilo sources. I had great difficulties
to understand such items. Because first i must find the corresponding source
text files and then i must produced debug listings. Often the compilation
already fails because of some header files are missing. For nasm assembler
-l option does do this. Unfortunately i need one day to get the same effect
with the gcc compiler by additional "-Wa,-adhln -g" option. So i was able to
interpret the byte sequences for memtest86 samples.
As reference i could use URL to an "old" bootsect.S. Then you left on your
own and you must interpret the assembler instructions. So instead i use a
page about Linux i386 Boot Code HOWTO on tldp.org. This is now expressed by
line like:
<RefURL>
https://tldp.org/HOWTO/Linux-i386-Boot-Code-HOWTO/bootsect.html </RefURL>
Because it is similar to "newer" kernel described by bin-linux-v1.trid.xml i
choose similar text. This is expressed by line like:
<FileType>Linux kernel/bootloader/tools x86 bootable (v1)</FileType>
I do this because it does not only describe Linux kernel like ELKS variant.
It also used to start similar staff like lilo boot loader or memory checking
tools like memtest86+.
Instead of generic mime type application/octet-stream i choose an user
defined one. That is expressed by line like:
<Mime>application/x-linux-kernel</Mime>
With the knowledge what i should expect i run tridscan to generate
bin-linux-v1.trid.xml. Then i try to understand what items happen and try to
refine definition blocks. In front block section the starting x86 machine
instructions are expressed as XML construct like:
<Bytes>B8C0078ED8</Bytes>
<Pos>0</Pos>
At the end of the boot sector we found the boot signature that is expressed
as expected by XML construct like:
<Bytes>55AA</Bytes>
<ASCII> U</ASCII>
<Pos>510</Pos>
Some bytes before some variables are stored, which are obviously the same
like in modern variant. At offset 498 (01F2 hexadecimal) the root flags are
stored as 2 byte little endian variable root_flags. Obviously value 1 means
the root is mounted read-only and value 0 means mounted not read-only. So
this means the upper byte is already nil. So this is expressed by XML
construct like:
<Bytes>00</Bytes>
<Pos>499</Pos>
According to file command at offset 502 the swap device number is stored as
2 byte value. In my samples this value is zero. According to documentation
at offset 504 the ram size is stored as 2 byte variable ram_size. In my
samples this value is zero. This is expressed by XML construct like:
<Bytes>00000000</Bytes>
<Pos>502</Pos>
Assuming that here also other values could occur i delete that pattern.
I also get some nil pattern in x86 instruction blocks. These are expressed
by XML constructs like:
<Pattern>
<Bytes>0000</Bytes>
<Pos>462</Pos>
</Pattern>
<Pattern>
<Bytes>000000000000</Bytes>
<Pos>473</Pos>
</Pattern>
When i look inside the binaries and source files i see that before the block
with stored variables fields the x86 machine block ends. There i found in
many samples a boot message string like \x0d\0aLoading\0 that is
displayed. The remaining bytes til stored variable blocks are filled with
nil bytes for padding purpose. So i delete these XML constructs.
In global string section there was 1 line that looks like:
<String>RQSP</String>
When i looked in debug listing i see that this the byte sequence for x86
machine instructions "pushw %dx pushw %cx pushw %bx pushw %ax". I do not
know if this is always true. Because i do not know if the 5 byte starting
pattern is so unique enough i just keep it and mention this fact in the
remark line.
With the new trid definition now also such "old" Linux kernel variant
samples are described (see appended output/trid-v-new.txt). TrID
definitions, few samples and output are stored in archive linux_trid.zip. I
hope that my definition can be used i future version of triddefs.
With best wishes
Jörg Jenderek