Author Topic: 3 cdf-v.trid.xml for Common Data Format .CDF (Read 5233 times)

jenderek · « **on:** March 04, 2022, 02:00:28 AM »

Hello trid users,

some days ago i handled some NetCDF Data samples. According to some
documentation beside NC file name extension also CDF was used. So i
looked on my systems for examples with that extension.

Some examples are described as "Unknown!" by TrID
v2.5/output/trid-old.txt v2.6/output/trid-old.txt
v3/output/trid-old.txt).

When running file command (version 5.41) on such examples these are
described as "Common Data Format". Some examples are also described by
"(Version 2.5 or earlier)" (see appended v2.5/output/file-5.41.txt).
Some examples are also described as "(Version 2.6 or 2.7)" (see
appended v2.6/output/file-5.41.txt) and some examples are also
described as "(Version 3 or later)" (see appended
v3/output/file-5.41.txt).

The file command mentions a use defined mime type (see appended
file-i-5.41.txt). So this is now expressed by line like:
   <Mime>application/x-cdf</Mime>

With this information i was able to find a page about Common Data
Format on file formats archive team web site. That is expressed by
line like:
   <RefURL>http://fileformats.archiveteam.org/wiki/Common_Data_Format</RefURL>

There also a link to specifications of CDF internal formats as PDF
Document with name cdf35ifd.pdf is mentioned. With this help i was
able to refine the definition generated by tridscan. According to that
documentation for version 2.6 and earlier version 2.5 the formats are
nearly the same. The difference is the starting magic and the stored
version information. In Version 3 the file structure is similar to
other earlier versions. The only differences are the fields for record
sizes and offsets. They are 8-bytes, instead of 4-bytes. So an average
definition for all version is not possible. Furthermore step to refine
in one definition can be done in the definition for the other
versions.

So i run tridscan on samples in v2.5 directory to generate
netcdf-v2.5.trid.xml. Afterwards i check and refine this
definition. The first XML-construct looks like:
   <Bytes>0000FFFF0000FFFF0000</Bytes>
   <Pos>0</Pos>

The first 4 bytes are the magic pattern. The next 4 bytes determine if
the file is a regular file (big endian hexadecimal 0000FFFF) or is a
compressed type (big endian hexadecimal CCCC0001). Unfortunately my
few inspected samples are always regular. At offset 8 the CDR size is
stored at 4 byte integer. In my version 2.6 examples this has always
value 130h. For my version 2.5 examples in one example
ge_k0_cpi_19921231_v02.cdf i found value 7c9h. In my version 3
examples the observed value was 130h as 8-byte integer. Assuming that
also higher CDR sizes and compressed examples exist, the first XML
construct now shrinks like:
   <Bytes>0000FFFF</Bytes>
   <Pos>0</Pos>
For version 2.6 this now becomes like:
   <Bytes>CDF26002</Bytes>
   <Pos>0</Pos>
For version 3 this now becomes like:
   <Bytes>CDF30001</Bytes>
   <Pos>0</Pos>

At offset 12 (16 for version 3 ) the type of the record is stored. At
the beginning the record type is always CDR (CDF Descriptor Record
with value 1).
At offset 16 (20 as 8 byte for version 3) the offset of the GDR is
stored. Often this 138h in earlier version 2 examples and 140h in
version 2.6 , but i also found 7D1h (ge_k0_cpi_19921231_v02.cdf). In
version 3 examples this value was 140h. These 2 facts ware expressed
by XML-construct like:
   <Bytes>000000010000</Bytes>
   <Pos>12</Pos>
Assuming that also higher offset of GDR maybe appear, this now becomes
like:
   <Bytes>00000001</Bytes>
   <Pos>12</Pos>
and in cdf-v3.trid.xml this becomes like:
   <Bytes>00000001</Bytes>
   <Pos>16</Pos>

Beside the starting magic the version is explicitly stored in file. At
offset 20 the main version number (of course 2) is stored followed by
release number (that is low like 4 5 6 7). The increment number is
stored at offset 44. For version 3 these offsets are 28, 32 and
52. This was expressed by XML construct like:
   <Bytes>00000002000000</Bytes>
   <Pos>20</Pos>

Assuming that also higher release number occur, then only the main
version number remains. That is now expressed by XML construct like:
   <Bytes>00000002</Bytes>
   <Pos>20</Pos>
and in cdf-v3.trid.xml this becomes like:
   <Bytes>00000003</Bytes>
   <Pos>28</Pos>
Interesting example a1_k0_mpa_20050804_v02.cdf has magic of version
2.6-2.7 but stored version is 2.4.7 ( see appended
v2.6/output/file.tmp).

At offset 28 (36 for v3) the encoding for attribute entry and variable
values is stored as 4 byte big endian integer. In my version 2
examples i only found value 1. That means network encoding. In version
3 examples i also found value 6. That means Intel representation.
Highest allowed value is 9 for Power PC representation. That means 3
upper bytes of that value are always nil.
At offset 32 (40 for v3) the Boolean flags are stored as 4 byte big endian
integer. The value for "not MD5 checksum" is 16 and the other higher
bits are reseed future use and are always clear. That means that value
is always below 32 or the 3 upper bytes are always nil. That was expressed
in in my definition by XML-construct like:
   <Pattern>
      <Bytes>00000001000000</Bytes>
      <Pos>28</Pos>
   </Pattern>
So assuming other encodings this now becomes like:
   <Pattern>
      <Bytes>000000</Bytes>
      <Pos>28</Pos>
   </Pattern>
   <Pattern>
      <Bytes>000000</Bytes>
      <Pos>32</Pos>
   </Pattern>

At offset 36 (44) rfuA is stored as 4 byte integer followed by
rfuB. These are reserved for future use and are to nil. Followed by
low increment number. This was expressed by XML construct like:
   <Bytes>0000000000000000000000</Bytes>
   <Pos>36</Pos>
Assuming that higher increments may exist this now becomes like:
   <Bytes>0000000000000000</Bytes>
   <Pos>36</Pos>

At offset 48 (56 for v3 ) rfuD is stored as 4 byte integer followed by
rfuE. These are reserved for future use and are set to -1. Followed by
copy right message lines which is similar in my inspected
example. That was expressed by XML construct like:
<Bytes>FFFFFFFFFFFFFFFF0A4E5353444320436F6D6D6F6E204461746120466F726D6174</Bytes>
<ASCII> . . . . . . . . . N S S D C C o m m o n D a t a F o r m a t</ASCII>
<Pos>48</Pos>
Assuming that also other copyright messages without line feeds exist
this now becomes like:
<Bytes>FFFFFFFFFFFFFFFF</Bytes>
<Pos>48</Pos>

So the copyright field starts at offset 56 (64 for version 3) and the
maximal length is 1945 for version prior to 2.5 and then 256. Because
the examples often contain similar text i get XML constructs like:
   <Pattern>
      <Bytes>2843</Bytes>
      <ASCII> ( C</ASCII>
      <Pos>82</Pos>
   </Pattern>
   <Pattern>
      <Bytes>53</Bytes>
      <ASCII> S</ASCII>
      <Pos>276</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00000000000000000000000000000000000000</Bytes>
      <Pos>296</Pos>
   </Pattern>
and in global strings section i get lines like:
   <String>GODDARD SPACE FLIGHT CENTER</String>
   <String>COMMON DATA FORMAT (CDF)</String>
   <String>(INTERNET -- CDFSUPPORT</String>
   <String>(C) COPYRIGHT 1990-</String>
   <String>MARYLAND 20771 USA</String>
   <String>.GSFC.NASA.GOV)</String>
   <String>GREENBELT</String>
So with other copyright messages the above items vanished.

After copyright message i got often short nil byte sequences like:

   <Pattern>
      <Bytes>000000</Bytes>
      <Pos>312</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>1907</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>1909</Pos>
   </Pattern>
This are probably triggered by lucky circumstances. So i delete these
nil sequences.

In global string section are short line, that are obviously result of
lucky circumstances like:
      <String>C COM</String>
      <String>E'''4</String>
      <String>E'''8</String>
      <String>Y IN</String>
So i delete such lines.

So assuming that version 2.5 and 2.6 principal are the same, then the
same key words should occur in both definitions. If this is not the
case i delete or shorten in global section lines like:

      <String>ACE SCIENCE</String>
      <String>ATION</String>
      <String>CENTER</String>
      <String>PI_AFFILIATION</String>
      <String>PI_NAME</String>
      <String>UNITS FOR TIME_PB5</String>
      <String>YEARDAY MSEC</String>
So a few lines survived like:
      <String>FORMAT</String>
      <String>LABLAXIS</String>
      <String>PROJECT</String>
      <String>SOURCE_NAME</String>
      <String>VALIDMAX</String>
      <String>VALIDMIN</String>
I am not sure if these are required or created by lucky
circumstances. So i kept these lines. But when comparing with version
3 and using same logic then most of these lines also vanished.

With the 3 new TrID definitions all of my inspected cdf examples are
now described correctly like "Common Data Format" with 3 sub classes
for different versions (see appended v2.5/output/trid-v-new.txt
v2.6/output/trid-v-new.txt v3/output/trid-v-new.txt).

TrID definitions, some examples and output are stored in archive
cdf_.zip.

With best wishes
Jörg Jenderek

Mark0 · « **Reply #1 on:** July 14, 2022, 01:37:49 AM »

Thanks, and sorry for the delay!

Mark0's Forum

News: