Author Topic: ark-rpm-src-v30.trid.xml for source variant of RPM Package  (Read 401 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
ark-rpm-src-v30.trid.xml for source variant of RPM Package
« on: March 18, 2024, 03:28:11 AM »
Hello trid users,

some days ago i looked at the content of an exotic CD-ROM. There are also
stored Red Hat Packages. Most samples have RPM file name suffix.

So i run trid utility on such packages. All samples are recognized and
described as "RPM Package (generic)" by ark-rpm.trid.xml. Here as mime type
application/x-rpm is shown and RPM is displayed as file name suffix (see
appended trid-v-old.txt in output). As reference the page about RPM Package
Manager on Wikipedia is used. That is expressed by line like:
   <RefURL>http://en.wikipedia.org/wiki/RPM_Package_Manager</RefURL>

For comparison reason i also run the file format identification utility DROID
(See https://sourceforge.net/projects/droid/). Here the samples are also
recognized. The samples are described here as "RPM Package Manager file". Here
no mime type is listed and only RPM suffix is considered as valid (see
EXTENSION_MISMATCH in appended droid-rpm.csv in output). It also does a sub
classification. Three different versions are listed (1 2 3). These are
described as via PUID fmt/793m fmt/794 and fmt/795. Here i find hint for my
observed unusual items. Here it is written that RPM files currently appear in
two defined types, the first is binary package files containing the compiled
version of certain software. The second is source package files containing the
source code used to produce a package. These have an appropriate tag in the
file header that distinguishes them from Binary RPMs, causing them to be
extracted to /usr/src on installation. Source package files customarily carry
the file extension ?.src.rpm". So that is only half of the truth.

For comparison reason i also run file command (version 5.45) on such samples.
Here the samples are also recognized. These are here described generic as
"RPM" (see appended file-5.45.txt in output). It also shows the version
information as done by DROID. here the three version are shown (as v1 v2
v3.0). After that information phrase bin is shown for binary packages and src
for source packages. Afterwards the canonical architecture names are shown for
most cases. For some samples that information is here not show. Obviously this
is a bug in current file command. That fact had me made nervous when looking
in package collection outputs. The mime type application/x-rpm is here also
shown (see appended file-i-5.45.txt in output). Here no file name suffix is
shown (see appended file-ext-5.45.txt in output).

On Linux according to shared MIME-info database the binary samples are called
"RPM package". Here application/x-rpm is shown as mime type. The samples are
just recognized by looking for 4 byte sequence \xed\xab\xee\xdb at the
beginning. Here also rpm is listed as suffix. But for source packages here
application/x-source-rpm is shown as mime type and 2 name endings are listed
(*.src.rpm *.spm). That information can be seen in freedesktop.org.xml.in
source found for example on gitlab.freedesktop.org. Unfortunately i found no
example with SPM suffix on my systems. To verify these information i search
for SPM suffix and i i found at least 2 different web sites with
information. These sites are:

 https://filext.com/file-extension/SPM
 https://www.file-extensions.org/spm-file-extension-source-package-manager-data

On Wikipedia it is explicitly explained that instead of longer name ending
.src.rpm also .spm can be used to overcome some limit on old file systems like
FAT16 (with 8+3 name length).

Luckily i found information about RPM packages file formats on archive team
web site. That is expressed inside new definitions by line like:
   <RefURL>http://fileformats.archiveteam.org/wiki/RPM</RefURL>

The current Wikipedia page used as reference is here are mentioned as
link. The advantage is that here also download links to samples and software
are listed. Further more a direct line to package file format is listed here
in specifications section.

So i run tridscan on my inspected source samples to new definition.  The two
file name suffix are expressed by line like:
   <Ext>RPM/SPM</Ext>
The mime type is expressed by line like:
   <Mime>application/x-source-rpm</Mime>

So i looked at generated patterns and try to understand and refine it by
looking at specifications. The first construct looks like:
   <Bytes>EDABEEDB0300000100</Bytes>
   <Pos>0</Pos>
According to documentation the files begin with signature bytes
EDABEEDB. Afterwards comes major version byte followed by minor revision
number byte. In my inspected samples these are 3 and 0, which is read as
version 3.0. Unfortunately none of inspected samples is an older
version. Maybe that with older samples patterns shrink. So at the moment i
concentrate on version three and therefore use definition name
ark-rpm-src-v30.trid.xml.  At offset 6 the type is stored as 2 byte big
endian.  The value indicates whether this is a source or binary package. This
value shall be 0 to indicate a binary package. Then apparently value 1 means
source package. At offset 8 the architecture number is stored as 2 byte big
endian. There exist only a dozen of possible values. At the moment highest
number is 23 for loongarch64 and 255 for noarch. So the upper byte of this
number will by always nil.

The second construct looks like:
 <Bytes>00000000000000000000000000000000000000000000000000000000000000000000
 00010005000000000000000000000000000000008EADE80100000000000000</Bytes>
 <Pos>42</Pos>
At offset 10 the package name is stored. This string i null terminated and has
a length of 66. Apparently in my samples the maximal package name length is
not used. So i got many nil bytes. At offset 76 osnum is stored as 2 byte big
endian. This is indicating the Operating system. This shall be 1 but in few
binary packages (MainActor-2_06linux.rpm openssl-0.9.8zh-2.aix5.1.ppc.rpm
openssl-1.0.2s-1.aix5.1.ppc.rpm PGPCommandLine-10.4.1.54-MP2-aix5.3.ppc.rpm) i
found value 255. At offset 78 the signature type is stored as 2 byte big
endian. According to specification this value shall be 5. At offset 80 16
reserved bytes are stored. In all my inspected samples this is nil except in
example MainActor-2_06linux.rpm (this starts 44d2ffbfda5c0508).  At offset 96
the structures rpmheader start with 4 byte magic "\216\255\350\001" (8eade801
hexadecimal). This followed by 4 reserved bytes. This value shall be
"\000\000\000\000". At offset nindex is stored as as 4 byte big endian. That
is the number of index records that follow this header record. In my examples
i get "low values ( like 2 3 4 5 7 8 9 10). Assuming that package name and
index records is maximal the most of first and the last 3 nil bytes vanish in
above XML construct. So this now will become like:
 <Bytes>0000010005000000000000000000000000000000008EADE80100000000</Bytes>
 <Pos>75</Pos>

At higher offset occur some short nil sequences like:
   <Pattern>
      <Bytes>0000</Bytes>
      <Pos>108</Pos>
   </Pattern>
   <Pattern>
      <Bytes>0000</Bytes>
      <Pos>112</Pos>
   </Pattern>
   <Pattern>
      <Bytes>000000</Bytes>
      <Pos>116</Pos>
   </Pattern>
   <Pattern>
      <Bytes>0000</Bytes>
      <Pos>120</Pos>
   </Pattern>
   <Pattern>
      <Bytes>000000</Bytes>
      <Pos>124</Pos>
   </Pattern>
   <Pattern>
      <Bytes>0000</Bytes>
      <Pos>128</Pos>
   </Pattern>
   <Pattern>
      <Bytes>000000</Bytes>
      <Pos>132</Pos>
   </Pattern>
   <Pattern>
      <Bytes>000000</Bytes>
      <Pos>136</Pos>
   </Pattern>
   <Pattern>
      <Bytes>0000</Bytes>
      <Pos>140</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>144</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>184</Pos>
   </Pattern>
I assume that this are triggered by lucky circum stances. Some 4 byte fields
in other structures are not reaching maximum. Assuming what such value can
reach "high" values or occur at other offset these pattern vanish.

In global strings section i get lines like:
   <String>CPIO'GZIP'9</String>
   <String>.SPEC</String>
   <String>.TAR.</String>
   <String>LINUX</String>

Apparently this are triggered by RPM nature. RPM contain gzip compressed CPIO
archives. The sources are packed as TAR archives.  And the build instruction
are stored in SPEC text files. I do not know if is possible to use other
compression, packing formats here.

With the new definition all my source packages are still recognized and
described with correct suffix (see appended trid-v-new.txt in output).

TrID definitions, some samples and output are stored in archive rpm_.zip. I
hope that my definition can be used in future version of triddefs.

With best wishes
J?rg Jenderek


Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Re: ark-rpm-src-v30.trid.xml for source variant of RPM Package
« Reply #1 on: March 20, 2024, 10:48:44 PM »
Thanks Joerg!