Author Topic: updated lz4*.trid.xml for LZ4 compressed stream + variants like Mozilla  (Read 810 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
Hello trid users,

some days i run ccleaner cleanup tool on Windows. One option is called "Unused
File Extension". When i use this option it complains about file name suffix
LZ4. So i looked on my systems for such files.

So i run trid utility on my LZ4 examples. Many samples are described as "LZ4
compressed stream" with mime type application/lz4 by llz4.trid.xml. A few
samples like mtools.conf.lz4 are described as "LZ4 compressed stream (old)" by
lz4-old.trid.xml with mime type application/lz4.  One example test-v1.lz4 is
described as "Unknown!". I find also samples with name webext.sc.lz4 which are
also described as unknown (see appended output/trid-v-old.txt)

For comparison reason i also run the file format identification utility DROID
( See https://sourceforge.net/projects/droid/). Here no example is recognized.

For comparison reason i also run file command (version 5.44) on such
samples. When it recognize the samples these are called "LZ4 compressed data".
What is described by TrID with additional phrase (old) is here described with
additional phrase (v0.1-v0.9) like for example mtools.conf.lz4.  What is
described by TrID without additional phrase is here described with additional
phrase (v1.4+) like for example hosts.lz4.

The "middle aged" sample test-v1.lz4 which is not recognized by TrID is here
described with additional phrase (v1.0-v1.3).  So i mention this version
information inside the remark line.  I myself do not find real examples for
this variant so i construct such sample with help of hex-editor.

So i create a TrID definition lz4-v1.trid.xml by transferring the magic
pattern of file command database Magdir/compress.  But i do not know if such
"middle aged variant really exist.  According to file this start also with a
characteristic 4 byte similar sequence. So this is expressed by XML construct
like:
   <Bytes>03214C18</Bytes>
   <ASCII> . ! L</ASCII>
   <Pos>0</Pos>
In lz4.trid.xml there this construct looks like:
   <Bytes>04224D18</Bytes>
   <ASCII> . " M</ASCII>
   <Pos>0</Pos>

The used reference URL jump to page about Luftschiff Zeppelin 4, a German
experimental airship. Apparently this is the wrong page when checking for LZ4
disambiguation the correct page is another one. This is now expressed by line
like:
   <RefURL>https://en.wikipedia.org/wiki/LZ4_(compression_algorithm)</RefURL>
In the middle aged variant i choose page on file formats archive team web
site. This is expressed by line like:
   <RefURL>http://fileformats.archiveteam.org/wiki/LZ4</RefURL>

In current definition the mime type application/lz4 was used, but when i look
at IANA there is such type not registered On Linux machines another type is
used. This is also shown by file command (see appended
output/file-i-5.44.txt).  So according to shared mime database i now also use
this type. This is expressed by line like:
   <Mime>application/x-lz4</Mime>

I and other people often complaining about Microsoft behaviour, but open
software is also not the holy grail in every field. And Mozilla Firefox and
Thunderbird are considered as flagships in that field. Some samples with name
webext.sc.lz4 are found in some Mozilla Firefox and Thunderbird user
directories. I found such samples on Windows, Raspbian and Mint operating
systems. Now comes the evil part which i call orcifying software and it is
not the first time Mozilla is doing such steps. It is like elves turn into
orcs as told in Tolkien tales. They probably took standard software like LZ4
compression algorithm and modify it. This step is OK but they do not mention
what they do and there exist no file format specification and viewing or
unpacking tools. The file type is also not officially registered. Some people
say "may the source be with you", but when unpacking Firefox or Thunderbird
packages i get hundred of MB with source time. Unfortunately nearly nobody has
enough expertise and time to find there the needed explanations. And the
worst part is that the use lz4 suffix. So everybody assume that you can unpack
such file with standard tools like lz4. But this does not work.

There exist similar samples with suffix MOZLZ4 or JSONLZ4. There the same
problem occur, but there other suffix is used and there exist software tools
like lz4jsoncat which can do uncompressing.

Because i found no real documentation for that file format is use generic page
about Mozilla inside. This is expressed by line like:
   <RefURL>https://en.wikipedia.org/wiki/Mozilla</RefURL>

I also try other tools like lz4jsoncat but they do not work with these
webext.sc.lz4. I also look for signatures found in other real LZ4 samples and
do not found such hints. So i am also not sure that in these samples lz4
compression algorithm is used. So i keep generic mime type here. That is
expressed by line like:
   <Mime>application/octet-stream</Mime>
Characteristic are the first starting bytes. This are expressed by XML
construct like:
   <Bytes>6D6F7A4A5353434C7A34307630303100</Bytes>
   <ASCII> m o z J S S C L z 4 0 v 0 0 1</ASCII>
   <Pos>0</Pos>
This looks similar to definition mozlz4.trid.xml for MOZLZ4/JSONLZ
suffix. There the starting bytes are described by XML construct like:
   <Bytes>6D6F7A4C7A343000</Bytes>
   <ASCII> m o z L z 4 0</ASCII>
   <Pos>0</Pos>

There the next 4 bytes contain the original uncompressed file size.
Because of uncertainties i keep all construct in my new definition. There are
some short nil byte sequences inside front block section.  These are expressed
by XML constructs like:
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>19</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>24</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>33</Pos>
   </Pattern>
Maybe that there also some "short" file sizes are stored. But i am only
guessing and do not know this.  Maybe somebody else know this and can describe
my observed items.

In global strings i also get many lines. Here i also do not know which are
relevant and which are optional. So i keep all lines.  I hope that other users
can refine my observed patterns.

I also consider such things as security issue. All virus writer say thanks to
Mozilla. Now the ycan put their malicious code packed with such a name or
simlar.  Because lz4 is compressed nobody sees other papcked stuff here as
strange. Because here it is not standard LZ4 compression and no viruas scanner
complains such people could put their packed malicius code inside such
directories. Because of neddle in the hay princip it is difficuult to detect
and anti software tools must use much resource and KI power to detect and
protect. So instead of wasting must time and resource the IT companies should
spend efforft in doing first just simple things like describing and offically
registering their file formats.

With the updated and new trid definitions now all my LZ4 samples are
described. TrID definitions and output are stored in archive lz4_many.zip. I
hope that my definitions can be used in future version of triddefs.

With best wishes
Jörg Jenderek

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Thanks!