Author Topic: 2 replacement for pst*.trid.xml for Microsoft OutLook Personal Folder *.PST  (Read 1024 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
Hello trid users,

some days ago i run Pirisoft ccleaner tool. It complains about file name
extension PAB. So I look for such examples and related files on my systems.


When running TrID on Microsoft Outlook PST examples i got unexpected output.

All PST examples are described by pst.trid.xml as ANSI variant of Microsoft
OutLook Personal Folder. Even the uni code examples are described as such
types and even worse Unicode examples with unusual version like test-v37.pst
are described definitively wrong as only ANSI variant. But PST examples can
be ANSI or Unicode variant, but not both at the same time. Furthermore the
DROID samples x-fmt-248-signature-id-260.pst and
x-fmt-249-signature-id-261.pst are described as OutLook Personal
Folders. But these are not real Outlook examples. These contain just few
dozen starting bytes of such outlook files (See appended
output/trid-v-old.txt). So current pst.trid.xml is in reality a generic
description of all such Microsoft PST examples.

For comparison reason i also run the file format identification utility
DROID ( See https://sourceforge.net/projects/droid/). This identifies many
examples as "Microsoft Outlook Personal Folders".  It shows also under
version year ranges. The ANSI variant is described by PUID x-fmt/248 and an
additional 1997-2002, whereas for the Unicode the range 2003-2007 is shown
by PUID x-fmt/249. So here we get also the year information that is also
shown by file command. Some examples with unusual versions (14 and 37) like
test-v16.pst and test-v37.pst are not recognized (See appended
output/droid-pst.csv).

For comparison reason i check these examples by file command utility. When
running file command (newest version version >5.42) all PST examples are
described as expected. (See appended output/file.txt).

Luckily DROID and file command shows a related URL and used file name
extensions. With this information i was able to find a page about Personal
Folder File on file formats archive team web site. There a link to official
Microsoft description [MS-PST].pdf is mentioned. And also unofficial PFF
format specification is listed as "Personal Folder File (PFF) format.pdf".
So instead of generic website microsoft.com now i use this as new
reference. That is expressed by line like:
   <RefURL>
   http://fileformats.archiveteam.org/wiki/Personal_Folder_File
   </RefURL>

Instead generic mime type application/octet-stream i display
application/vnd.ms-outlook mentioned on reference site and used by newest
file command (See appended output/file-i.txt). But this not mentioned on
other sites and is not official registered. So maybe this must be changed
again. So at the moment this is expressed by line like:
   <Mime>application/vnd.ms-outlook</Mime>
      
In current definitions the first and significant recognition happens by 4
starting bytes called dwMagic in specification. That is expressed by XML
construct like:
   <Bytes>2142444E</Bytes>
   <ASCII> ! B D N</ASCII>
   <Pos>0</Pos>

The next characteristic is a 2 byte sequence called wMagicClient at offset
8. For PST examples this is string SM. For PAB examples this is AB and for
OST examples this is SO. This is expressed by XML construct like:
   <Bytes>534D</Bytes>
   <ASCII> S M</ASCII>
   <Pos>8</Pos>
So by these 2 constructs inside pst.trid.xml ALL PST examples are described.

At offset 10 file format version is stored as 2 byte integer wVer in little
endian.  Unfortunately this version variable wVer is not clearly explained.
It it is written that this value must be 14 (=Eh) or 15 (=Fh) if the file is
an ANSI PST file. From version 21 (=15h according to non-official
documentation) or value greater than 23 (=17h) it is a Unicode PST file
(UTF-16 little-endian).
The two versions 14 and 23 seems to be the common versions. By
pst-unicode.trid.xml is assumed that version is always 23. That is expressed
by XML construct like:
   <Bytes>17</Bytes>
   <Pos>10</Pos>
But according to documentation highest mentioned value is 37.  If the value
is 37, it indicates that the file is written by an Outlook of version that
supports Windows Information Protection (WIP).  Then examples with unusual
versions like test-v37.pst are not described as Unicode variant.

In "newer" variant format has now become to Unicode, but also the size of
some fields grow from 32-bit to 64-bit or meaning changed.  So after the
first twenty four bytes the fields also appear at other positions.


So i run tridscan on my ANSI PST examples to generate pst-v1997.trid.xml.
Afterward i look at constructs and clean up lines.

At offset 10 file format version is stored as 2 byte integer wVer in little
endian. Here i found low values like 14 or 15. So the upper byte is nil. And
this must be always true because til about value 23 format is Unicode.
At offset 12 client file format version is stored as 2 byte integer
wVerClient.  Here i found value 19 (=13) for PST and 22(=16h) for PAB
examples. That was expressed by by third XML construct like:
   <Bytes>001300</Bytes>
   <Pos>11</Pos>
In documentation is not written that this is always true. So maybe there
exist samples with other values. So this construct now becomes like:
   <Bytes>00</Bytes>
   <Pos>11</Pos>

At offset 20 for 4-bytes are stored as dwReserved2. Implementations should
ignore this value and SHOULD NOT modify it. Creators of a new PST file must
initialize this value to zero. That was expressed by fourth construct like:
   <Bytes>00</Bytes>
   <Pos>23</Pos>
So i delete this.

At offset 24 next BID is stored as 4 byte integer bidNextB in little
endian. This was expressed by fifth construct. This looks like:
   <Bytes>00</Bytes>
   <Pos>27</Pos>
So when inspecting more examples we should find here other values. So i
delete that construct.

At offset 28 Next page BID is stored as 4 byte integer bidNextP in little
endian. This was expressed by 6th construct. This looks like:
   <Bytes>00</Bytes>
   <Pos>31</Pos>
So when inspecting more examples we should find here other values. So i
delete that construct.

At offset 36 a fixed array of 32 NodeID with 4 bytes is stored as rgnid. A
typical value is 0x400 for NID_TYPE_NORMAL_FOLDER. That is expressed by XML
constructs like:
   <Pattern>
      <Bytes>00000004000000040000</Bytes>
      <Pos>34</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00000004000000040000000400000004
      <Pos>114</Pos>
   </Pattern>
Assuming that here also other listed types can occur here i delete these
constructs.

At offset 164 4-bytes are stored as dwReserved. Unused space; MUST be set to
zero. That with last bytes of rgnid was expressed by construct like:
   <Bytes>00000000000000</Bytes>
   <Pos>162</Pos>
Assuming that also other rgnid values can occur, this now becomes like:
   <Bytes>0000000000</Bytes>
   <Pos>164</Pos>

At offset 168 the file size is stored as 4 byte integer ibFileEof in little
endian.  At offset 172 the offset to the last AMap page is stored as 4 byte
integer ibAMapLast in little endian. In my inspected examples file size is
apparently below 4 GiB and ibAMapLast values are not too different. That was
expressed by construct like:
   <Bytes>000024</Bytes>
   <ASCII> . . $</ASCII>
   <Pos>171</Pos>
When file size reach 4 GiB limit and different ibAMapLast occur that
construct vanish.

At offset 172 the file offset to the last data allocation table is stored as
4 byte integer in little endian. This looks like:
   <Bytes>00</Bytes>
   <Pos>175</Pos>
Assuming that also other values occur here i delete this construct.

At offset 176 the total available data size is stored as 4 byte integer in
little endian. This looks like: At offset 180 the total available page size
is stored as 4 byte integer in little endian. This looks like:
   <Pattern>
      <Bytes>0000</Bytes>
      <Pos>179</Pos>
   </Pattern>
   <Pattern>
      <Bytes>0000</Bytes>
      <Pos>182</Pos>
   </Pattern>
Assuming that also other values occur here i delete this construct.

At offset 184 the descriptor index back pointer is stored as 4 byte integer.
At offset 188 The descriptor index file offset is stored as 4 byte integer.
This looks like:
   <Bytes>0000</Bytes>
   <Pos>187</Pos>
   <Bytes>00</Bytes>
   <Pos>191</Pos>
Assuming that also other values occur here i delete these constructs.

At offset 192 the file offset index back pointer is stored as 4 byte
integer.  At offset 196 the file offset index file offset is stored as 4
byte integer. This looks like:
   <Bytes>0000</Bytes>
   <Pos>195</Pos>
Assuming that also other values occur here i delete this construct.

At offset 200 the Allocation table validation type is stored as 1 byte
integer variable fAMapValid. The value 0 means invalid (INVALID_AMAP). 2
means valid (VALID_AMAP2) and value 1 also means valid but is deprecated
(VALID_AMAP1). Before 8 bytes of BREF structure are stored. After fAMapValid
1 byte variable bReserved and 2 byte wReserved are stored.  Implementations
should ignore these values and should not modify it.  Creators of a new PST
file MUST initialize this value to zero.  That was express by construct
like:
   <Bytes>000001000000</Bytes>
   <Pos>198</Pos>
This at least must shrink to something expressed by construct like:
   <Bytes>000000</Bytes>
   <Pos>201</Pos>
But you can not 100 percent rely that these reserved bytes are always
nil. So i completely delete that construct.

At offset 204 128 bytes deprecated FMap are stored as rgbFM. This is no
longer used and MUST be filled with 0xFF. Readers should ignore the value of
these bytes. That was expressed by my examples via construct like:
   <Bytes>000000000000000000000000000000000000000000000000000000000
   <Pos>216</Pos>
Obviously that is not always true. So i completely delete that construct.

At offset 332 128 bytes deprecated FMap are stored as rgbFP. This is no
longer used and MUST be filled with 0xFF. Readers should ignore the value of
these bytes. That was expressed by my examples via construct like:
   <Bytes>FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
   <Pos>333</Pos>
Obviously that is not always true. So i completely delete that construct.

Now comes the interesting part. At offset 460 bSentinel is stored as 1 byte
integer. That must be set to to 0x80. At offset 461 Encryption type is
stored as 1 byte integer bCryptMethod.  Zero means no encryption. One is
used for encryption with 'permutation algorithm'. Two is used for encryption
with 'cyclic algorithm' and 16 is used for encrypted with Windows
Information Protection (WIP). My ANSI examples were encrypted with
permutation algorithm.  Afterwards til 511 comes some reserved areas. There
again is written that these fields should be initialized to zero and
software ignore these values and should not modify it. That was expressed by
part of XML construct like:
   8001000000000000000000000000000000000000</Bytes>
So this shrink to something like:
   <Bytes>80</Bytes>
   <Pos>460</Pos>


So i run tridscan on my Unicode PST examples to generate pst-v2003.trid.xml.
Afterwards i look at constructs and clean up lines.

At offset 10 file format version is stored as 2 byte integer wVer in little
endian. Here i found low values like 23 or 37. So the upper byte is
nil. This must not be true when higher versions arise.  At offset 12 client
file format version is stored as 2 byte integer wVerClient.  Here i found
value also 19 (=13h) for Unicode variant. That seem to be the standard at
the moment.  At offset 14 bPlatformCreate and bPlatformAccess are stored as
1 byte integer.  These value must be set to 0x01, but also 0x02 was found in
scanpst recovered pst.  At offset 16 for 4-bytes are stored as dwReserved1
and dwReserved2. Implementations should ignore this value and SHOULD NOT
modify it. Creators of a new PST file must initialize this value to
zero. But according to unofficial documentation sometimes contains like
40000000h (unclean unmount?).  At offset 24 for 8-bytes are stored as
bidUnused. According to unofficial documentation sometimes contains like
0400000001000000h.  That was expressed by construct like:
   <Bytes>001300010100000000000000000400000001000000</Bytes>
   <Pos>11</Pos>
Assuming that also other values can occur her i delete that construct.

At offset 32 next page BID is stored as 8-byte integer as bidNextP.  By
lucky circumstances upper bytes were zero in my examples.  This was
expressed by construct like:
   <Bytes>0000000000</Bytes>
   <Pos>35</Pos>
So when inspecting more examples we should find here other values. So i
delete that construct.

At offset 40 a monotonically-increasing value is stored as 4 byte integer
dwUnique in little endian.  At offset 44 a fixed array of 32 NodeID with 4
bytes is stored as rgnid. A typical value is 0x400 for
NID_TYPE_NORMAL_FOLDER.  That is expressed by XML constructs like:
   <Bytes>000004000000040000</Bytes>
   <Pos>43</Pos>
   <Bytes>04000000040000</Bytes>
   <Pos>161</Pos>
So when inspecting more examples we should find here other values. So i
delete that constructs.

At offset 172 unused bytes are stored as 8-byte integer as qwUnused.  These
should be set to zero.  At offset 180 root starts with 72 bytes. That also
start with reserved bytes stored as 4-byte integer as dwReserved.
Implementations should ignore this value and should not modify it.  Creators
of a new PST file must initialize this value to zero.  At offset 184 the
total file size is stored as 8-byte integer as ibFileEof.  That was
expressed by constructs like:
   <Bytes>000000000000</Bytes>
   <Pos>170</Pos>
   <Bytes>0000000000000000</Bytes>
   <Pos>177</Pos>
Assuming that these are not always be true this now becomes like:
   <Bytes>00000000</Bytes>
   <Pos>180</Pos>

At offset 184 til 191 the total file size is stored as 8-byte integer as
ibFileEof.  The upper limit is not reached in my examples. So the upper byte
are nil.  At offset 192 til 199 the Last data allocation table offset is
stored as 8-byte integer as ibAMapLast. That was expressed by constructs
like:
   <Bytes>0000000000</Bytes>
   <Pos>188</Pos>
   <Bytes>00000000</Bytes>
   <Pos>196</Pos>
Assuming higher file sizes and higher ibAMapLast exist these vanish.

At offset 200 the total available data size is stored as 8-byte integer as
cbAMapFree.  At offset 208 the total available page size is stored as 8-byte
integer as cbPMapFree.  This was expressed by constructs like:
   <Bytes>000000000000</Bytes>
   <Pos>203</Pos>
   <Bytes>0000000000</Bytes>
   <Pos>211</Pos>
Assuming other and higher values these vanish.

At offset 216 the descriptor index back pointer is stored as two 8-byte
integer as BREFNBT. This was expressed by constructs like:
   <Bytes>000000000000</Bytes>
   <Pos>219</Pos>
   <Bytes>00000000</Bytes>
   <Pos>228</Pos>
Assuming other and higher values these vanish.

At offset 232 the file offset index back pointer is stored as two 8-byte
integer as BREFBBT.  At offset 248 the allocation table validation type is
stored as 1-byte integer fAMapValid. In my examples only 1 occurred which
means valid.  Afterwards 1 byte variable bReserved and 2 byte wReserved are
stored.  Implementations should ignore these values and should not modify
it.  Creators of a new PST file MUST initialize this value to zero.  At
offset 252 4 unused alignment bytes as dwAlign. These must be set to zero.
These were expressed by constructs like:
   <Bytes>000000000000</Bytes>
   <Pos>235</Pos>
   <Bytes>000000000100000000000000</Bytes>
   <Pos>244</Pos>
So this now becomes like:
   <Bytes>00000000</Bytes>
   <Pos>252</Pos>

At offset 256 128 bytes deprecated data FMap are stored as rgbFM. This is no
longer used and MUST be filled with 0xFF. Readers should ignore the value of
these bytes. That was expressed in my examples by constructs like:
 <Bytes>0000000000000000000000000000000000000000000000000000000000000000000000</Bytes>
 <Pos>257</Pos>
 <Bytes>00000000000000000000000000000000000000000000000000000000000000000000000000</Bytes>
 <Pos>293</Pos>
 <Bytes>0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000</Bytes>
 <Pos>331</Pos>
Obviously that is not always true. So i completely delete these constructs.

At offset 384 128 bytes deprecated FMap are stored as rgbFP. This is no
longer used and MUST be filled with 0xFF. Readers should ignore the value of
these bytes. Now comes the interesting part. At offset 512 bSentinel is
stored as 1 byte integer. That must be set to to 0x80.  That was expressed
by my examples via construct like:
 <Bytes>FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF</Bytes>
 <Pos>385</Pos>
 <Bytes>FFFFFFFFFFFFFFFFFFFFFFFFFF      FFFFFFFFFFFFFFFFFF80</Bytes>
 <Pos>419</Pos>
So this now becomes like:
 <Bytes>80</Bytes>
 <Pos>512</Pos>

At offset 514 2 reserved bytes are stored as rgbReserved.  These must be set
to zero.  That is expressed by construct like:
   <Bytes>0000</Bytes>
   <Pos>514</Pos>

At offset 516 next available index pointer is stored as 8-byte integer as
bidNextB.  At offset 524 a weak CRC32 of the previous 516 bytes is stored as
4-byte integer as dwCRCFull.  These were expressed by constructs like:
   <Bytes>0000000000</Bytes>
   <Pos>519</Pos>
Assuming other and higher values these vanish.

With the new 2 trid definition the PST examples are now described as ANSI or
as Unicode variant. Furthermore the misidentification of DROID PST samples
vanished ( see appended output/trid-new-v.txt).  TrID definitions, some
examples and output are stored in archive pab_.zip. I hope that my XML files
can be used in future version of triddefs.

The considerations made for PST examples are probably also true for
Outlook Exchange Offline Storage (*.OST handled by ost.trid.xml) and
Microsoft Personal Address Book (*.PAB  handled by pst.trid.xml) because
these use shared file format. Unfortunately i myself have only a very old
Outlook version. So i must run an old Windows 98 in a virtual box emulator.
Therefor i had problems to generate my real own examples.

With best wishes
Jörg Jenderek



Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2731
    • Mark0's Home Page
I have a small cache of this kind of files to check, so I will return on this at a later time.
Meanwhile, thanks as usual.