Author Topic: updated aup3.trid.xml sqlite-3x.trid.xml for Audacity 3 Project  (Read 662 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
Hello trid users,

some days ago i send definition for SQLite Write-Ahead Log shared memory.
There i found many different unexpected suffix. So for control reasons i
look for extensions of related SQLite database. For every shared memory file
exist a companion database file without 4 byte phrase -shm at the end. One
of such sort has 4 bytes suffix AUP3 for Audacity 3 Project.

So i run trid utility on such Audacity examples. My AUP3 samples are
described as correctly generic as "SQLite 3.x database" with mime type
application/x-sqlite3 by sqlite-3x.trid.xml. But my samples are not
described as "Audacity 3 Project" by aup3.trid.xml (See appended
output/trid-v-old.txt).

For comparison reason i also run file command (version 5.44) on such
samples. Here these AUP3 samples are described as "SQLite 3.x database" with
additional information like application id 1096107097 (=41554459 hexadecimal
or AUDY string; see appended output/file-5.44.txt) and mime type
application/vnd.sqlite3 (see appended output/file-i-5.44.txt)

For comparison reason i also run the file format identification utility
DROID ( See https://sourceforge.net/projects/droid/). Here the examples are
also recognized. These are described as "Audacity Project File" with version
"3.x" and without mime type by PUID fmt/1826 (See appended
output/droid-aup3.csv)

There exist an official registered mime type application/vnd.sqlite3 at
iana.org for SQLite 3.x database. For the inspected Audacity Project samples
i found no mime type. Because Audacity Projects are just SQLite 3 database
these should at least get that mime type instead of generic mime type
application/octet-stream or deprecated application/x-sqlite3. That is now
expressed inside TrID definitions by line like:
   <Mime>application/vnd.sqlite3</Mime>

So i run tridscan on my AUP3 samples to update aup3.trid.xml.  With the help
of file format specification i look at my TrID definition and try to
understand why some constructs appear or where i can refine the definition.

In the first 16 bytes are the string "SQLite format 3\0". Afterwards comes
the database page size in bytes stored as 2 byte in big endian. The value 1
must be interpreted as 65536.  So these two facts are expressed by the first
construct:
   <Bytes>53514C69746520666F726D61742033000001</Bytes>
   <ASCII> S Q L i t e   f o r m a t   3</ASCII>
   <Pos>0</Pos>
I do not know if this page size is used for all Audacity 3 Project
samples. So i mention this observation in a remark line.

At offset 20 unused "reserved" space at the end of each page is stored as
byte. This is usually usually 0. This is also observed for AUP3 samples.  At
offset 21 maximum embedded payload fraction is stored as byte. This must be
64 (40 hexadecimal).  At offset 22 Minimum embedded payload fraction is
stored as byte. This must be 32 (20 hexadecimal).  At offset 23 Leaf payload
fraction is stored as byte. This must be 32 (20 hexadecimal).  At offset 24
file change counter is stored as 4 byte big endian integer.  So these these
facts are expressed by the second construct:
   <Bytes>00402020000000</Bytes>
   <ASCII> . @</ASCII>
   <Pos>20</Pos>
Normally the file counter is "low". So in the examples the upper 3 bytes
were nil, but when you doing many changes in the projects this value can
reach 4 GB limit. So the last 3 bytes in second XML construct will vanish
and this becomes like:
   <Bytes>00402020</Bytes>
   <ASCII> . @</ASCII>
   <Pos>20</Pos>

At offset 40 the schema cookie is stored as as 4 byte big endian integer. In
old definition this value was 3, but in my examples this value is 4. In
other database i also found higher values like A8.  At offset 44 the schema
format number is stored as 4 byte big endian integer. Supported schema
formats are 1, 2, 3, and 4 The last one is often found. Also for AUP3
samples.  offset 48 the suggested page cache size is stored as 4 byte big
endian integer. Often this value is nil. This is also observed for Audacity
samples.  offset 52 the page number of the largest root b-tree page is
stored as 4 byte big endian integer. This only applies when in auto-vacuum
or incremental-vacuum modes. Otherwise this value is zero.  At offset 56 the
database text encoding is stored as 4 byte big endian integer. A value of 1
means UTF-8. The value 2 means UTF-16le and 3 means UTF-16be encoded. In
Audacity examples this value was 1 At offset 60 the "user version" as read
and set by the user_version is stored as 4 byte big endian integer. In AUP3
samples this value was 50331648 ( hexadecimal 03000000).  At offset 64 the
vacuum mode is stored as 4 byte big endian integer.  True (non-zero) for
incremental-vacuum mode and false (zero) otherwise.  In AUP3 samples this
value was nil.  At offset 68 the "Application ID" set by PRAGMA
application_id is stored as 4 byte big endian integer.  That is the most
important sub classification feature to distinguish the Audacity databases
from others.  For Audacity this is hexadecimal byte sequences 41554459 or
expressed as AUDY string. That is also used by DROID tool. That is evil side
with open software. Everyone can take such software lite SQLite, but using
another file extension and other features like specific application id and
not telling these differences. This in the end leads to situation like in
bible view building attempt a tower of Babylon or in Tolkien view like
transforming elves to orcs.  At offset 72 20 reserved bytes for expansion
are stored. These must be zero. So these facts were expressed by the third
construct:
 <Bytes>000000030000000400000000000000000000000103000000000000004155445900</Bytes>
 <ASCII> . . . . . . . . . . . . . . . . . . . . . . . . . . . . A U D Y</ASCII>
 <Pos>40</Pos>
Assuming that also schema cookie can be higher and reach 4 GB limit and
neglecting 20 reserved bytes this construct now becomes like:
 <Bytes>00000004000000000000000000000001030000000000000041554459</Bytes>
 <ASCII> . . . . . . . . . . . . . . . . . . . . . . . . A U D Y</ASCII>
 <Pos>44</Pos>

At offset 96 the compile-Time Library version number SQLITE_VERSION_NUMBER
is stored as 4 byte big endian integer. In current definition that was byte
sequence 002E43C3 ( 3032003 decimal or read as 3.32.3). In my examples i get
decimal value 3035005 ( that can be read as 3.35.5 or hexadecimal as
002E4F7D. In other database i get values like: 0 3007014 3008011 3016002
3017000 3022000 3028000 3031001 3032003 3035005 The header end at offset
100. There other parts start.  So these facts were expressed by the last
construct:
   <Bytes>002E43C30D00000004FD3D00FF6AFED1FD8FFD3D00</Bytes>
   <ASCII> . . C . . . . . . . = . . j . . . . . =</ASCII>
   <Pos>96</Pos>
Assuming that other SQLITE_VERSION_NUMBER and higher may exist and
neglecting non header parts this construct can be deleted.

With the updated trid definitions now my AUP3 examples are described (see
appended output/trid-v-new.txt). TrID definitions, few samples and output
are stored in archive aup3_.zip. I hope that my definition can be used in
future version of triddefs.

With best wishes
Jörg Jenderek

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2732
    • Mark0's Home Page
Re: updated aup3.trid.xml sqlite-3x.trid.xml for Audacity 3 Project
« Reply #1 on: April 23, 2023, 02:28:00 PM »
Thanks for the updated definitions!