Hello trid users,
some days ago i looked at my ZIP archive collection. When i run TrID on
more than 300 Open Publication Structure eBooks with file name
extension epub most are described correctly by epub.trid.xml. But a
few like welcome.epub are only identified as ZIP compressed archive by
ark-zip.trid.xml ( see appended output/trid-v.txt).
For comparison reason i also run other file identification tools.
The tool DROID ( See
http://digital-preservation.github.io/droid/)
also describes welcome.epub as "epub format" by signature id 483 ( See
output/epub-droid.csv).
The epub documents are just zip containers. This is expressed by
XML-construct:
<Bytes>504B0304</Bytes>
<ASCII> P K</ASCII>
<Pos>0</Pos>
So i look in output of decompressing tools 7-zip with list and show
technical information (See appended output/7z-l-slt.txt) and output of
unzip with verbose zipinfo option ( See output/unzip-Zv.txt).
All Epub samples contain a file with 8 byte name mimetype. The content
is stored uncompressed and contains 20 bytes mime type string
"application/epub+zip". In most samples this archive member is the
first one, but not for welcome.epub. There it is the last one, but
this is only a problem for recognition by current file(1) command.
Who does not obey the conventions used by most others. That is Adobe.
The "strange" epub is part of Adobe Digital Editions ( Version 4.5 for
me) and is found in "My Digital Editions" sub directory inside
Documents directory in my HOME directory.
In the "bad" examples the mimetype member has some extra fields. That is
universal time fields (ID 0x5455) and Unix UID and GID field ( ID 0x7875)
According to page about Zip file format on Wikipedia at the end of the
local file header the file name ( that is for inspected samples
mimetype) is stored followed by optional m bytes for extra field. That
is followed by member data (That is for inspected samples mime type
string application/epub+zip). So for "good" examples without extra
fields we find thees 2 ASCII strings concatenated. That was expressed
in global section by line like:
<String>MIMETYPEAPPLICATION</String>
In the welcome.epub after filename mimetype comes some extra field byte
sequence starting with UT ( That is 55 54 for time fields) and the type
string application/epub+zip appears some bytes later. This is now
expressed in updated epub.trid.xml by 2 lines like:
<String>APPLICATION</String>
<String>MIMETYPE</String>
Similar considerations can be done for central directory file
header. At the end of an entry the file name is stored. That is
followed by optional extra field and file comment. Afterward comes
next entry starting with ZIP magic string PK. So for "good" examples
without extra fields and file comments we find concatenated string
mimetypePK. That was expressed in global section by line like
<String>MIMETYPEPK</String>
In "bad" welcome.epub here after file name mimetype comes extra fields
byte sequence starting again with UT for time stamps. So in updated trid
definition this now becomes in global string section like:
<String>MIMETYPE</String>
With the updated trid definition all examples are now detected ( see
appended output/trid-new-v.txt). TrID definitions, some examples and
output are stored in archive epub.zip. I hope that my XML file can be
used in future version of triddefs.
With best wishes
Jörg Jenderek