Hello trid users,
some days ago i handled some NetCDF Data samples with file name extension
NC.
When i run TrID on such examples most are described correctly as "NetCDF
Network Common Data Form" by netcdf.trid.xml (see appended
output/trid-v-old.txt).
When running newest file command (version >5.41) on such examples these are
described as "NetCDF Data Format data", but it shows an variant "(64-bit
offset)" like in example tst_small_64bit.nc. Furthermore some examples like
bad_cdf5_begin.nc are described as "data" (See appended output/file.txt).
For comparison reason i also run the file format identification utility
DROID ( See
https://sourceforge.net/projects/droid/). This identifies the
64-bit variant as "netCDF-3 64-bit" by PUID fmt/283, whereas the other
variant is described as "netCDF-3 Classic" by PUID fmt/282. It also skips
example bad_cdf5_begin.nc (See appended output/nc-DROID.csv). It mentions
like Wikipedia page 2 mime types:
application/netcdf
application/x-netcdf
But when checking mime types on iana.org the first is not official
registered, but only found on page "Provisional Standard Media Type
Registry" and is described as for temporary use or test purposes only. TrID
shows only the first type. The file command only show the second one (See
appended output/file-i.txt). The FreeDesktop.org shared MIME database also
use the second one and describing text "Network Common Data Form". That
information can be found for example on web site reposcope.com. Because it
is not official registered i changed this to second type. This i now
expressed by line like:
<Mime>application/x-netcdf</Mime>
So i run tridscan on 32-bit samples to generate netcdf-v3.trid.xml.
Afterwards i check and refine this definition. The first XML-construct
looks like:
<Bytes>434446010000</Bytes>
<ASCII> C D F</ASCII>
<Pos>0</Pos>
According to specification it start with 3 byte string CDF followed by
version byte. This has value 1 for classic format and value 2 for 64-bit
variant. Afterwards comes 4 byte big endian integer. The value 0xFFffFFff
indicates indeterminate record count, allows streaming data. Other values
indicate the length of the record dimension (=numrecs). To by luckily
circumstances the inspected examples contain only low numrecs numbers and no
streaming data. Assuming that higher numrecs or streaming data exist, this
XML construct now becomes like:
<Bytes>43444601</Bytes>
<ASCII> C D F</ASCII>
<Pos>0</Pos>
The second XML item looks like:
<Bytes>0000000A000000</Bytes>
<Pos>8</Pos>
At offset 8 the dim_list is stored. The value 0000000Ah is tag for list of
dimensions (NC_DIMENSION) followed by 4 byte integer value containing the
number of elements in following sequence nelems. Obviously in my inspected
examples all have by lucky circumstances NC_DIMENSION and low nelems value.
According to specification here 8 byte integer zero means list is not
present. To catch also such examples the above XML construct must be
deleted.
The third XML item looks like:
<Bytes>000000</Bytes>
<Pos>16</Pos>
At offset 16 the elements are listed. By lucky circumstances these are
"low". Assuming that also higher values are possible the above construct
must be deleted.
Then i run tridscan on 64-bit samples to generate netcdf-v3-64.trid.xml.
Afterwards i check and refine this definition. The first construct looks
like:
<Bytes>43444602000000</Bytes>
<ASCII> C D F</ASCII>
<Pos>0</Pos>
According to specification it it nearly the same as for 32-bit variant. The
only difference is that version byte has value 2 and OFFSET variables are
64-bit signed integer instead of 32-bit. So just one XML construct remains
that now looks like:
<Bytes>43444602</Bytes>
<ASCII> C D F</ASCII>
<Pos>0</Pos>
Unfortunately on website like Wikipedia few information about the 64-bit
especially the different magic is mentioned. So i use here another web
site. That is expressed by line like:
<RefURL>
https://www.loc.gov/preservation/digital/formats/fdd/fdd000330.shtml </RefURL>
There exist only exactly 2 variants of the netCDF classic format, because in
newer version variants (4.0 and above) file format is based on HDF5 with
complete other starting magic (like \211HDF\r\n\032\n).
So by current netcdf.trid.xml the average of these 2 variant is described by
XML construct like:
<Bytes>434446</Bytes>
<ASCII> C D F</ASCII>
<Pos>0</Pos>
So you might think that does not hurt. But file command and DROID tool also
check for valid version byte. So example bad_cdf5_begin.nc with invalid
version byte value 5 is not described as NetCDF file by these tools. As the
name suggest this example is an invalid NetCDF file for test purposes. It
can be found in nc_test sub directory of NetCDF-C package (version 4.8.0 and
4.8.1). So i think it is better to remove netcdf.trid.xml and just keep the
new two variants to avoid misidentification of such "bad" examples.
According to some documentation beside NC file name extension also CDF was
used. That was expressed by line like:
<Ext>CDF/NC</Ext>
Unfortunately in my inspected examples i only found extension NC, but i keep
this line in 32-bit variant. According to
www.file-extensions.org cdf was
used as file name extension, but in 1994 changed to NC to avoid a clash with
other file formats using also the CDF file name extension. And according to
release notes the 64-bit offset format was introduced in version
3.6.0-beta1. That is dated between 2004-02-03 (version 3.5.1) and 2004-08-24
(version 3.6.0-beta3). So i assume that for 64-bit variant only examples
with NC file name extension exist. So in netcdf-v3-64.trid.xml this is
expressed by line like:
<Ext>NC</Ext>
With the 2 new TrID definitions all of my inspected netcdf examples are now
still described correctly like "NetCDF Network Common Data Form (v3)", but
misidentification vanished (see appended output/trid-v-new.txt).
TrID definitions, some examples and output are stored in archive nc_.zip.
With best wishes
Jörg Jenderek