Author Topic: db-plocate.trid.xml for plocate database  (Read 717 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
db-plocate.trid.xml for plocate database
« on: September 11, 2023, 11:55:27 AM »
Hello trid users,

some weeks ago ago i handled some SQLite database samples. Often these have
the file name suffix DB. So i looked for such samples on my
systems. Unfortunately this suffix is also used for other database formats. In
this session i will handle plocate database. After handling some mlocate
database typically with name mlocate.db i looked for such standard search
database samples on my Linux Mint 21.1 system. At first glance surprisingly
the locate utility mlocate is replaced by plocate because it is faster and the
database is smaller according to own documentation. So it is here the standard
locate utility instead of mlocate.

Typically the database is stored as /var/lib/plocate/plocate.db (on Mint
21.1). But by option parameters other database name and path could be
used. The utility is for example described by page on Wikipedia like:
     https://en.wikipedia.org/wiki/Locate_(Unix)

The calling of this program is described in Linux User Manual plocate(1). You
can find this on the web for example via link like:
      https://manned.org/plocate.1

The companion to creates or updates a database is called
updatedb.plocate. This command is described by Linux User Manual
updatedb.plocate(8). So i use that URL as reference inside new trid definition
db-plocate.trid.xml. This is expressed by line like:
   <RefURL>
   https://manned.org/updatedb.plocate.8
   </RefURL>

Because of Debian alternative the different programs mlocate and plocate and
their man pages are symlinked as locate and updatedb program names. Also the
parameters are the same. The difference is the database which use other format
and names (plocate.db instead of mlocate.db). There exist also a tool
plocate-build that convert mlocate database to plocate database as described
in man page plocate-build(8).

So i run trid utility on such plocate database samples. All of my samples are
described as "Unknown!" (See appended output/trid-v-old.txt).

For comparison reason i also run the file format identification utility DROID
(See https://sourceforge.net/projects/droid/). Here the examples are described
wrong as "Thumbs DB file" with version "XP" and mime type
application/vnd.microsoft.windows.thumbnail-cache by PUID fmt/682 via DB
extension.

For comparison reason i also run file command (version 5.45) on such
samples. Here samples are only described generic as "data" (see appended
output/file-5.45.txt) and generic mime type application/octet-stream (see
appended output/file-i-5.44.txt). The correct file suffix is also not
recognized (see appended output/file-ext-5.45.txt).

Instead of generic mime type i choose a user defined one.  That is expressed
by line like:
   <Mime>application/x-plocate</Mime>

Luckily plocate is open source. So with the help of the header file db.h i
tried to understand and improve the trid patterns. According to that the
samples starts with 8 byte magic \0plocate. At offset 8 the version is stored
as uint32_t. In my examples the version was 1. So these 2 facts were expressed
in Front Block section by XML construct like:
   <Bytes>00706C6F6361746501000000</Bytes>
   <ASCII> . p l o c a t e</ASCII>
   <Pos>0</Pos>
According to header file 2 is the current version. Maybe the version will be
increasing some steps, but i a assume it will not jump over 255 limit. So the
upper 3 bytes will be nil for a very long time. So this becomes like:
   <Pattern>
      <Bytes>00706C6F63617465</Bytes>
      <ASCII> . p l o c a t e</ASCII>
      <Pos>0</Pos>
   </Pattern>
   <Pattern>
      <Bytes>000000</Bytes>
      <Pos>9</Pos>
   </Pattern>

At offset 12 the hashtable_size is stored as uint32_t. For "empty" samples i
get here value 1 and for "real" example plocate.db i get here value 0x1b5c3
(see patched file.tmp in output). At offset 16 extra_ht_slots is stored as
uint32_t. In all examples the value was 10h. So these observations are
expressed by XML construct like:
   <Bytes>0010000000</Bytes>
   <Pos>15</Pos>

I do not know if is possible to get instead of 10h other value for
extra_ht_slots. So i assume that this variable values has always value 10h. So
i keep this and mention it in a remark line. For real samples the hashtable
size increase. So i assume that that values can reach 4 GB limit. So the above
construct becomes like:
   <Bytes>10000000</Bytes>
   <Pos>16</Pos>

At offset 20 num_docids is stored as uint32_t. For empty samples i get here
value 0 and for "real" example a132h. So the 2 upper bytes are nil. That was
expressed by XML construct like:
   <Bytes>0000</Bytes>
   <Pos>22</Pos>
Assuming that that this can reach the 4 GB limit, the above construct vanish.

At offset 24 hash_table_offset_bytes is stored as uint64_t. For empty samples
i get here value 0x78=120 ( that is some bytes behind header) and for "real"
example af3e56h. So the 5 upper bytes are nil. That was expressed by XML
construct like:
   <Bytes>0000000000</Bytes>
   <Pos>27</Pos>
Assuming that this value can reach 16 EB limit the above construct vanish.

At offset 32 filename_index_offset_bytes is stored as uint64_t.  For empty
samples i get here value 0x70=112 ( that is some bytes behind header) and for
"real" example aa34bef. So the 5 upper bytes are nil.
At offset 40 max_version is stored as uint32_t. I my examples i get here value
2. At offset 44 zstd_dictionary_length_bytes is stored as uint32_t. I my
empty examples i get here value 0 or 1024=0400h for "real" example.  So these
facts are expressed by XML construct like:
   <Bytes>00000000000200000000</Bytes>
   <Pos>35</Pos>
According to header file for max_version the values 1 and 2 are
listed. Assuming that this value is also always below 255 limit and offset
value can reach 16 EB limit and other values for zstd_dictionary_length_bytes
are possible then this will become like:
   <Bytes>000000</Bytes>
   <Pos>41</Pos>

At offset 48 zstd_dictionary_offset_bytes is stored as uint64_t. For my empty
examples i get here value 0 and for real examples 70h. So the 7 upper bytes
are nil. That is expressed by XML construct like:
   <Bytes>00000000000000</Bytes>
   <Pos>49</Pos>
Assuming that this value can reach 16 EB limit the above construct vanish.

At offset 56 directory_data_length_byte is stored as uint64_t. For my empty
examples in get here value 9 and for real example 48f0bh, but the
interpretation depends on some version fields. So the 5 upper bytes are nil.
That is expressed by XML construct like:
   <Bytes>0000000000</Bytes>
   <Pos>59</Pos>
Assuming that this length can reach 16 EB limit the above construct vanish.

At offset 64 directory_data_offset_bytes is stored as uint64_t. For my empty
examples in get here value 198h and for real example 1461af9h, but the
interpretation depends on some version fields. So the 4 upper bytes are nil.
At offset 72 next_zstd_dictionary_offset_bytes is stored as uint64_t. For my
empty examples in get here value 0 and for real example 400h, but the
interpretation depends on some version fields. So the 6 upper bytes are nil.
That is expressed by XML constructs like:
   <Bytes>0000000000</Bytes>
   <Pos>68</Pos>
   <Bytes>000000000000</Bytes>
   <Pos>74</Pos>
Assuming that directory_data_offset_bytes can reach 16 EB limit and other
next_zstd_dictionary_offset_bytes are possible the above construct vanish.

At offset 80 next_zstd_dictionary_offset_bytes is stored as uint64_t. For my
empty examples i get here value 0 and for real example 14aaa04h, but the
interpretation depends on some version fields. So the 6 upper bytes are nil.
That is expressed by XML construct like:
   <Bytes>00000000</Bytes>
   <Pos>84</Pos>
Assuming that this values can reach 16 EB limit the above construct vanish.

At offset 88 conf_block_length_bytes is stored as uint64_t.  For my empty
examples i get here values ( like 93h 97h 98h 219h 21ch 21eh) and for real
example 219h, but the interpretation depends on some version fields. So the 6
upper bytes are nil.  That is expressed by XML construct like:
   <Bytes>000000000000</Bytes>
   <Pos>90</Pos>

That are the few hundreds of bytes for the same 4 configuration variables as
in mlocate database. These variable names are expressed inside global strings
by lines like:
   <String>PRUNE_BIND_MOUNTS</String>
   <String>PRUNENAMES</String>
   <String>PRUNEPATHS</String>
   <String>PRUNEFS</String>
I see no reason why this block should exorbitantly grow over FFFFh limit
(=65535). So i keep the above construct.

At offset 96 conf_block_offset_bytes is stored as uint64_t. For my empty
examples i get here value 0x1a1 and for real example 0x14aae04h, but the
interpretation depends on some version fields. So the 6 upper bytes are nil.
That is expressed by XML construct like:
   <Bytes>00000000</Bytes>
   <Pos>100</Pos>
Assuming that this offset can reach 16 EB limit the above construct vanish.

At offset 104 the check_visibility Boolean is stored ( see -l option of
updatedb). That is the end of the header.  So all patterns with higher offsets
are generated by lucky circumstances in my opinion like:
   <Pattern>
      <Bytes>00000000000000</Bytes>
      <Pos>105</Pos>
   </Pattern>
   <Pattern>
      <Bytes>FF</Bytes>
      <Pos>155</Pos>
   </Pattern>
   <Pattern>
      <Bytes>000000</Bytes>
      <Pos>227</Pos>
   </Pattern>
   <Pattern>
      <Bytes>000000</Bytes>
      <Pos>253</Pos>
   </Pattern>
   <Pattern>
      <Bytes>0000</Bytes>
      <Pos>258</Pos>
   </Pattern>
   <Pattern>
      <Bytes>000000</Bytes>
      <Pos>261</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>278</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>333</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>338</Pos>
   </Pattern>
   <Pattern>
      <Bytes>65</Bytes>
      <ASCII> e</ASCII>
      <Pos>421</Pos>
   </Pattern>
   <Pattern>
      <Bytes>64</Bytes>
      <ASCII> d</ASCII>
      <Pos>426</Pos>
   </Pattern>
So i delete the above patterns.

With the new trid definition now my plocate database examples are recognized
and described (see appended trid-v-new.txt in output). TrID definition, some
samples and output are stored in archive plocate.zip. I hope that my
definitions can be used in future version of triddefs.

With best wishes
Jörg Jenderek


Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Re: db-plocate.trid.xml for plocate database
« Reply #1 on: September 15, 2023, 01:30:38 PM »
Thanks!