Author Topic: 2 variants mat-l4-*.trid.xml for Matlab MAT-File *.mat  (Read 1198 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
2 variants mat-l4-*.trid.xml for Matlab MAT-File *.mat
« on: July 02, 2021, 05:53:35 PM »
Hello trid users,

some days ago i inspected some Matlab examples with file name extension mat.

Most examples like one_by_zero_char.mat are described by mat-l5.trid.xml
correctly as "Matlab Level 5 MAT-File". But a few examples like
testcomplex_4.2c_SOL2.mat are only described as "Unknown!" (See appended
output/trid-v-old.txt).

For comparison reason i also run the file utility (version 5.40). This
describes the recognized examples as "Matlab v5 mat-file" with variants
"(big endian)" and "(little endian)" and the other examples as "data" (see
appended output/file-5.40.txt).

For the recognized examples a page about MAT on file formats archive team
website was mentioned as related URL. That i also used in new variant
definitions by line like:
   <RefURL>
   http://fileformats.archiveteam.org/wiki/MAT
   </RefURL>

On that page a MAT-File Format documentation matfile_format.pdf is
mentioned. Beside the Level 5 MAT-File Format also the Level 4 MAT-File
Format was explained. So i see that the unrecognized MAT samples are just
older level 4 examples. According to that document i inspect these examples
with a patched file command. Most of these examples are big endian variants
(see appended output/file.tmp).

So i run tridscan on these examples on generate
mat-l4-be.trid.xml. According to mat-l5.trid.xml i choose similar
description. That is expressed by line like:

   <FileType>Matlab Level 4 MAT-File (big-endian)</FileType>

The same mime type as for level 5 is used. That is expressed by line like:

   <Mime>application/x-matlab-data</Mime>

According to specification such MAT files start 20-byte header with 5 long
integers that contains information describing certain attributes of the
matrix.

At offset 0 the type flag is stored. Biggest possible value is 4052
(=0xFD4). That means 2 upper bytes are always 0. In decimal that type
integer is represented as MOPT, where M counts the thousands and indicates
the numeric format of numbers on the machine. For big endian ( that means
Macintosh, SPARC, Apollo, SGI, HP 9000/300, other Motorola systems) M value
is 1. So lowest flag value is 1000 (=3E8 hexadecimal) and highest value is
1052 (=41C hexadecimal). The highest hexadecimal value with 3 as second byte
is 3FF (=1023 decimal). That is true for floating point numbers (P=0 for
double-precision 64-bit or P=1 for single-precision 32-bit) and for 32-bit
integers. So value for second byte is 3 or 4. So value 4 as second byte only
occur for 16-bit signed integers (P=3) 16-bit unsigned integers (P=4) 8-bit
unsigned integers (P=5) That is expressed by XML construct like:

   <Pattern>
      <Bytes>000003</Bytes>
      <Pos>0</Pos>
   </Pattern>

At offset 4 the number of rows in the matrix is stored as 4 byte integer.
So in theory the upper limit is 4 GiB,  but in my examples only "low" values
like 1, 3 and 8 occur. So for samples with row number lower than 256 the 3
upper bytes are nil. That is expressed by XML construct like:

      <Pattern>
         <Bytes>000000</Bytes>
         <Pos>4</Pos>
      </Pattern>

At offset 8 the number of columns in the matrix is stored as 4 byte integer.
So in theory the upper limit is 4 GiB,  but in my examples only "low" values
like 1, 3, 4, 5 or 43 occur. So for samples with columns number lower than 256 the 3
upper bytes are nil. That is expressed by XML construct like:

   <Pattern>
      <Bytes>000000</Bytes>
      <Pos>8</Pos>
   </Pattern>

At offset 12 the imaginary flag is stored as 4 byte integer. If this is 1,
then the matrix has an imaginary part. If 0, there is only real data. That
also means that 3 upper bytes are always nil. That is expressed by XML
construct like:

   <Pattern>
      <Bytes>000000</Bytes>
      <Pos>12</Pos>
   </Pattern>

At offset 20 the null terminated matrix is stored as ASCII string like
testcomplex and at offset 16 the length of this string is stored as 4 byte
integer. So in theory the upper limit is 4 GiB, but in my examples only
"low" values like 2, 10, 11, 12, 16 and 18 occur. So for samples with matrix
name length than 256 the 3 upper bytes are nil. That is expressed by XML
construct like:

   <Pattern>
      <Bytes>000000</Bytes>
      <Pos>16</Pos>
   </Pattern>

Unfortunately i only find 2 little endian examples. So i run tridscan on
these examples on generate mat-l4-le.trid.xml.

The same consideration as for big endian also apply here, but the byte
order is reversed. So the matrix name length is expressed by XML construct
like:

   <Pattern>
      <Bytes>000000</Bytes>
      <Pos>17</Pos>
   </Pattern>

Unfortunately all my inspected examples have type zero. So this pattern is no
very unique. According to documentation for little endian (PC, 386, 486, DEC
RISC) machine M value is 0. That means highest type value is 52 (=34
hexadecimal). That means the 3 upper bytes are always nil. That is
expressed in little endian by XML construct like:

   <Pattern>
      <Bytes>000000</Bytes>
      <Pos>1</Pos>
   </Pattern>

With the second definition my two little endian examples are now described as
"Matlab Level 4 MAT-File (little-endian)" but only with low recognition rate
in one digit percent range (See appended LE/output/trid-v-new.txt). The
leading description is "Adobe PhotoShop Brush" by abr.trid.xml or "V-Ray
Material (binary)" by vismat.trid.xml.

The only solution to improve recognition would be to generate maybe up to 18
little endian variants (6 for 8-bit unsigned integers til 64-bit
double-precision floating-point numbers) with 3 further variations (
numeric, text or sparse matrix), but the problem with the unspecific
character still remains. So many ISO 9660 CD image also looks like as Matlab
Level 4 MAT-File because such samples also contain many nil bytes at the
right positions.

The big endian variants are now recognized with 100 % (See appended
output/trid-v-new.txt).

TrID definition, some examples and output are stored in archive mat-l4.zip. I
hope that my 2 XML files can be used in future version of triddefs.

According to specification for VAX Cray machines the header file looks
different. So maybe for such machine types other trid definitions variant
must be created.

With best wishes
Jörg Jenderek


Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2730
    • Mark0's Home Page
Re: 2 variants mat-l4-*.trid.xml for Matlab MAT-File *.mat
« Reply #1 on: July 03, 2021, 09:50:09 PM »
Uhm... the LE def, with just 0s patterns, is probably very weak. 
I'll try and experiment a bit.

Thanks, as usual!