Author Topic: 3 replacement of dsk-skf*.trid.xml for IBM SKF disk image  (Read 936 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
3 replacement of dsk-skf*.trid.xml for IBM SKF disk image
« on: May 23, 2022, 01:53:52 AM »
Hello trid users,

some days ago handled files that start with value 0xAA. So i looked for
other starting with this value. So i looked at some floppy disk images with
DSK file name extension.

When running TrID on such hundreds examples i get an unexpected output.
Many examples ( like 06200D19.DSK 06200D9.DSK 2M256R-K.DSK H1-NK.DSK) are
not recognized and are described as "Unknown!".

For comparison reason i check these examples by file command utility. When
running file command (version 5.41) the undetected examples all recognized
(See appended output/file-5.41.txt). When running newest version also some
additional information ( like sectors, cylinders, heads, Media descriptor,
FAT and root entries) is shown (See appended output/file.txt).
The examples described by file command with an additional "old" are
described by TrID with additional "(old)" phrase by
dsk-skf-old.trid.xml. The examples described by file command with an
additional "compressed" are described by TrID with additional "(comp)"
phrase by dsk-skf-comp.trid.xml. The third variant examples described by
file command without an additional "compressed" or "old" are described by
TrID with additional "(uncomp)" phrase by dsk-skf.trid.xml (See appended
output/trid-v-dsk.txt).

Luckily TrID tool with option -v shows used file name extension and a
reference URL. That was expressed in the definition by line like:
 <RefURL>
 http://fileformats.archiveteam.org/wiki/LoadDskF/SaveDskF
 </RefURL>

There also links for samples and suitable software are listed. From there i
downloaded most of my inspected examples. I verified information partly by
7z package tool (See appended output/7z-l-dsk.txt) and decoding tool deark
(See appended output/deark-l-dsk.txt) by command lines like:
   7z l -tFAT *.DSK
   deark -l -m loaddskf  06200D19.DSK
7z has problems with images like DISK1-2.DSK and 2M256R-K.DSK. These are
compressed and 7z does listing via module FAT. As the name is hinting this
is made for "normal" floppy images with FAT file system. So apparently 7z
has no knowledge about floppy images where content is compressed. I also
often got here errors like "Unexpected end of archive". The SAVEDSKF tool
does not store and count unused sectors at the end. So probably that is the
reason for the error messages.
The deark tool (version 1.6.1) has an explicit module with name "loaddskf"
for such images. So apparently it can also handle compressed images. See
message text "Format: LoadDskF (new, compressed)". But it fails on examples
like 2-NK.DSK with messages "Warning: This file does not appear to contain a
valid FAT directory structure."

The "new" and uncompressed examples were described by dsk-skf.trid.xml. This
contain just one XML construct for starting bytes which was expressed by XML
construct like:
   <Bytes>AA59F00000020000010002E0002100200B091300</Bytes>
   <ASCII> . Y . . . . . . . . . . . !</ASCII>
   <Pos>0</Pos>
With the help of the documentation we can to begin to interpret this values.

After the starting 2 byte magic AA59 the media type is stored as 2 byte
integer.  For most floppy examples this value is hexadecimal F000, but i
also found examples with value F900 and FE00. The current TrID definition
assume that this value is always 0xF0. So samples with unusual Media
descriptors like H1-NK.DSK and 06200D19.DSK are not recognized by current
TrID definition.
At offset 4 the sector size is stored as 2 byte integer in little endian. So
found sequence 0002 means sector size 512 ( or 0x0200 in little endian) in
inspected examples.
At offset 6 the cluster mask is stored as 1 byte integer in little
endian. That is the number of sectors per cluster minus 1. Found sequence 00
means 1=sectors/cluster.
At offset 7 cluster shift is stored as 1 byte integer in little endian. That
is the log2(cluster size / sector size). Found sequence 00 means 1=cluster
size/sector size.
At offset 8 the reserved sectors are stored as 2 byte integer in little
endian. Found sequence 0100 means 1 reserved sector ( or 0001 in little
endian hexadecimal).
At offset 10 the number of FAT copies is stored as 1 byte integer. Found
sequence 02 means 2 FAT copies ( that is the standard).
At offset 11 the root directory entries are stored as 2 byte integer in
little endian. Found sequence E000 means 224 entries ( or 00E0 in little
endian hexadecimal).

At offset 13 the sector number of first cluster is stored as 2 byte integer in little
endian. Found sequence 2100 means first cluster at sector 33 ( or 0021 in little
endian hexadecimal).

At offset 17 the sectors per FAT is stored as 1 byte integer. Found sequence
09 means FAT with 9 sectors.

At offset 18 the sector number of root directory is stored as 2 byte integer in little
endian. Found sequence 1300 means root directory start at sector 19 ( or
0013 in little endian hexadecimal).

After running tridscan with examples mentioned in sample section with not F0
media descriptor this XML begins to shrink. Afterward i looked at new XML
and the observed values. Then i create normal FAT images with other unusual
values start DOS like system in Virtual-box emulator. There run SAVEDSKF to
create a DSK image from that virtual floppy. Then i do repeat this steps.

Finally i end with two XML constructs like:
   <Pattern>
      <Bytes>AA59</Bytes>
      <ASCII> . Y</ASCII>
      <Pos>0</Pos>
   </Pattern>
   <Pattern>
      <Bytes>000002</Bytes>
      <Pos>3</Pos>
   </Pattern>


The first describes the 2 byte magic.

After the magic bytes the media type is stored as 2 byte integer.
That is the first byte of the FAT. This value is also stored as byte
value inside boot sector. That means the upper byte of media type is
never used. So this seems to be nil in all examples. So this is done by
first 00 sequence of second. The next sequence 0002 means sector size 512. So
i create virtual floppies with other sizes. But when i try to create DSK
images by SAVEDSKF tool i always get garbage. Maybe this tool it too old to
handle exotic unusual sector sizes. Assuming that there exist other image
software the XML construct would become like:

   <Pattern>
      <Bytes>AA59</Bytes>
      <ASCII> . Y</ASCII>
      <Pos>0</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>3</Pos>
   </Pattern>

That means only 3 bytes are used for identification and only 2 bytes are non
zero. The file command documentation recommend to use at least 4 bytes for
identification or otherwise identification is not unique enough and you get
too many false hits that annoy the users. For the two other variants the
same consideration applies, but only with other starting magic bytes.


But when looking in DSK specification i know that there must be other
unique sequences.

So i run tridscan on the different DSK examples and create 3 replacement
for the DSK definitions.

So in dsk-skf-comp.trid.xml the second XML construct looks like:

   <Pattern>
      <Bytes>0000020000010002E000</Bytes>
      <Pos>3</Pos>
   </Pattern>

Because i know it must be like in the two other variants i can shrink this
to something like:

   <Pattern>
      <Bytes>000002</Bytes>
      <Pos>3</Pos>
   </Pattern>

As explained before assuming that also sector sizes unequal 512 (=0x0200)
may exist this now becomes in all three definitions like:

   <Pattern>
      <Bytes>00</Bytes>
      <Pos>3</Pos>
   </Pattern>

In dsk-skf-old.trid.xml were XML constructs like


   <Pattern>
      <Bytes>00</Bytes>
      <Pos>14</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>19</Pos>
   </Pattern>

Because i know it must be like in the two other variants and there these
constructs do no exist, so i can delete these constructs in "old" definition.

Next pattern is like:
   <Pattern>
      <Bytes>50000200</Bytes>
      <ASCII> P</ASCII>
      <Pos>24</Pos>
   </Pattern>

Because i know it must be like in the two other variants and there these
construct is short, so i replace it with shrink variant. So this now
becomes like:

   <Pattern>
      <Bytes>00</Bytes>
      <Pos>25</Pos>
   </Pattern>

At offset 24 the number of cylinders is stored as 2 byte integer in little
endian. Found values like 40 and 80 which are the typical standard value for
PC floppies. To gain more capacity you may reduce gap between magic parts
but this reliable works only with raising the cylinder counts by a few
numbers.  Since the mid-nineties there exist super floppy disks with a
capacity of several hundred MB with a correspondingly large number of
cylinders in the thousands. The development of DOS and OS/2 and also of
images software like SAVEDSKF stop before that time. There exist newer
software like 7z or deark, but there only read support is implemented. I saw
not new image software implementing IBM DSK format. The reason for this are
floppies are outdated and replaced by USB stick, and when handling now
floppy images the better way now is use standard floppy image and compress
the images with standard compression method/tools like zip (IMZ format) or if you want
higher compression use the newest and best tools/standard methods like bzip2
or better. So in short words the cylinder value is "low" and the upper byte
of that value is in all real world examples is nil as expressed by above XML
construct.


At offset 26 the number of heads is stored as 2 byte integer in little
endian. Found values like 2 and 1 here, which are the typical standard value
for standard PC floppies. I do not know if there exist exotic hardware with
more heads, but the theoretical upper limit is 255 ,because the head
value is stored in BIOS parameter block as byte byte value. So in short words the head value is "low" and the upper byte
of that value is always nil. That as expressed by XML construct like:

   <Pattern>
      <Bytes>00</Bytes>
      <Pos>27</Pos>
   </Pattern>

Then in all definitions the next construct looks like:

   <Pattern>
      <Bytes>0000000000</Bytes>
      <Pos>29</Pos>
   </Pattern>

At offset 26 the number of sectors per tracks is stored as 2 byte integer in
little endian. Found low values like 8 15 18 36 here. So the upper byte was
nil. I do not check if higher values are possible. So i am generous here and
assume that higher values are possible. So i remove that sequence.

At offset 30 4 bytes are unused and therefor these seem to be always nil. So
the above construct now becomes like:

   <Pattern>
      <Bytes>00000000</Bytes>
      <Pos>30</Pos>
   </Pattern>

After the header, an ASCII comment may follow, which seems to be
terminated by \r\n\0. In all my inspected comment examples the
comment directly starts after header. That means at offset 40 (or
hexadecimal 28). In theory the comment could appear later. So the
offset of the comment string itself is stored at position 36 as 2
byte value. When following mister Spock logic you would expect that
for non comment examples this value is something like zero or
negative. Unfortunately here the value is also 40.

Unfortunately there exist samples like 06200B13.DSK, 06200D19.DSK and
06200D9.DSK (Seems to concerning "old" variant") where first sector
is far away (1st sector at 0x200) but the possible comment part is
just mainly filled with nil bytes. So this was expressed by construct like:

   <Pattern>
      <Bytes>28000002000000000000000000000000000000000
      <ASCII> (</ASCII>
      <Pos>36</Pos>
   </Pattern>

So if you find "old" with comments here this construct shrinks like in others
and now becomes like:

   <Pattern>
      <Bytes>2800</Bytes>
      <ASCII> (</ASCII>
      <Pos>36</Pos>
   </Pattern>

In all my inspected examples the comment offset is for comments directly
after the header. This is also true for examples without comment. So at the
moment i just keep it and mention observation in remark line.
If somebody finds examples where comment is not stored
directly after the head then this construct will vanish.

So in my inspected "old" examples the first sector is stored at offset 512
(=0x200). And all examples are made and belong to IBM OS/2. That is visible
by OEM-ID "IBM 20.0" or "IBM 10.2". So that was expressed by XML construct
like:

   <Pattern>
      <Bytes>9049424D20</Bytes>
      <ASCII> . I B M</ASCII>
      <Pos>514</Pos>
   </Pattern>

When first sector does not begin at offset 512 or floppy is not made by IBM
tools, then this is not true any more. So i delete this construct.

All other constructs at higher positions are also generated by lucky
circumstances. So my inspected "old" examples had no volume label. That was
expressed by XML construct like:


   <Pattern>
      <Bytes>4E4F204E414D45202020204641542020202020</Bytes>
      <ASCII> N O   N A M E         F A T</ASCII>
      <Pos>555</Pos>
   </Pattern>

And this was also expressed inside global strings section by lines like:

   <String>NO NAME    FAT</String>

So for examples with a label this is not any more true. So i delete such
parts.

The DSK contain binary stuff. So the are classified generic by most as
mime type application/octet-stream. But newest file command use an own user
defined one (See appended output/file-i.txt). So this is now expressed in the
replacement by line like:

   <Mime>application/x-ibm-dsk</Mime>

With the 3 replacement trid definition now all DSK examples
described ( see appended output/trid--v-new-v). TrID definitions, some
examples and output are stored in archive dsk_file.zip. I hope that my 3 XML
files can be used in future version of triddefs.

With best wishes
Jörg Jenderek


Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2731
    • Mark0's Home Page
Re: 3 replacement of dsk-skf*.trid.xml for IBM SKF disk image
« Reply #1 on: July 14, 2022, 02:23:13 PM »
Thanks!