Author Topic: updated ldif.trid.xml for LDAP Data Interchange Format + remark variant  (Read 774 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
Hello trid users,

some days ago i must update my contacts and transfer it to a new installed
system (Linux Mint 21.1) and something is not working as expected.

I am astonished and shocked because Linux Mint is described as one of the best
distributions because it should be very user friendly.  I have installed
evolution, KAddressBook, Thunderbird with CardBook plugin, gramps DW-Kontakte
and corresponding libraries.

Some years ago i created many of my contacts data by Thunderbird with an
plugin not supported any more or KAddressBook.  For some contacts i have
already found the original VCF text files but for some i have not found the
text files.  So i tried to export them again a VCF. When i looked in the
exported samples then in every (!)  program at least on of the following
fields are missing or the content has changed:
CATEGORIES
CLASS
GENDER
GEO
LANG
LOGO
ORG
REV
TZ

Some tools complain that it can import a special version, but most tools
import the VCF samples without error or warning messages but swallowing some
data like a black hole. When re exporting some fields are vanished and even
worse some contents changed in sometimes not parse-able way like for ORG
field or some fields like LOGO are replaced by X-HD-PHOTO.

Instead of wasting money, time and resources for KI so companies and
organizations first should do more efforts in basic IT aspects. Like with
octane number for gasoline the format for electronic visit cards is in main
part standardized. In Principle VCF files are just simply text files with
about of dozens keywords. So i should be not too difficult to handle all of
these fields correctly. So put shame on the developers, leaders of companies
and organisation not forcing to do the real important items first.

So in my desperate efforts to get all data back i tried also other contact
formats. One has file name suffix ldif.

So i run trid utility on my LDIF examples. Many samples are described as "LDAP
Data Interchange Format" with mime type application/octet-stream by
ldif.trid.xml. But also many are not recognized and described as "Unknown!"
(see appended output/file-5.44.txt).

For comparison reason i also run file command (version 5.44) on such
samples. Here all samples are not recognised and described generic as "text"
(see appended output/file-5.44.txt). Therefor only generic mime type
text/plain is shown (see appended output/file-i-5.44.txt) and no file name
suffix is shown (see appended output/file-ext-5.44.txt).

For comparison reason i also run the file format identification utility DROID
(See https://sourceforge.net/projects/droid/). Here many examples are also
recognised. These are described as "LDAP Data Interchange Format" without mime
type by PUID fmt/611, but many are here also not recognized (see appended
output/droid-ldif.csv). But the recognition rate is higher. A few nore samples
are here also recognized like:
provision_basedn_options.ldif
provision_configuration_modify.ldif
provision_init.ldif

These samples are part of samba software suite. So i run tridscan to update
ldif.trid.xml. Then i looked what has changed. These samples do not contain
objectClass keyword. So in the global strings section line vanish like:
      <String>OBJECTCLASS</String>

Then only 1 characteristic pattern survived. That is expressed inside front
block section by XML construct like:
   <Bytes>646E3A20</Bytes>
   <ASCII> d n :</ASCII>
   <Pos>0</Pos>
That is the pattern used by DROID tool for recognition.

With this knowledge i patched the file command. Then i get an intermediate
output (see appended output/file.tmp). There we see that a few samples like
test-ako.ldif are not recognized because the first line is not matched by
search pattern. So there the first line looks like:

dn:: Y249SsO2cmcgSmVuZGVyZWssbWFpbD1qb2VyZy5qZW4uZGVyLmVrQGdteC5uZXQ=

So here after the first colon comes a second one instead of space character.
In mentioned page about LDIF in Wikipedia this observation is explained.  If
marked with '::' after the attribute name the data are encoded into ASCII
using base64 encoding. So the XML construct now becomes like:
   <Bytes>646E3A</Bytes>
   <ASCII> d n :</ASCII>
   <Pos>0</Pos>

When we look in output of patched file command we see that the unrecognized
samples does not fit by above pattern. These start with caret character #.
That obviously is a marker for a comment or remark line.

So i run tridscan on such samples to generate ldif-rem.trid.xml. So we know
what we should get here.  Because these samples start with comment maker at
the beginning we get inside front block section now only one XML
construct. This looks like:
   <Bytes>23</Bytes>
   <ASCII> #</ASCII>
   <Pos>0</Pos>

With just a few of such examples i get in global strings section lines like:
      <String>CONFIGURATION</String>
      <String>DESCRIPTION</String>
      <String>INFORMATION</String>
      <String>OBJECTCLASS</String>
      <String>DIRECTORY</String>
      <String>UPDATE</String>
      <String>.LDIF</String>
      <String>FLAGS</String>
      <String>GROUP</String>
      <String>NAME</String>

As expected such lines begin to vanish when running tridscan with more such
comment samples.  In the end we expect to get the common average like in non
comment variant. That would be 3 byte sequence dn:. Unfortunately TrID is not
able to use use such short patterns inside global strings section.  So with
enough samples i must stop with one line in global strings section.. This
looks like:
      <String>OBJECT</String>

With the updated trid definition and remark variant now most of my LDIF files
are described like before, but a few remark samples without objectClass
keyword like provision_rootdse_add.ldif are not recognized due to TrID
limitations (see appended output/trid-v-new.txt).

The LDIF files are just text files. So in principal the generic mime type
text/plain is OK, but on my linux systems these get their own type. According
to shared MIME-info database where this is called "LDIF address book" this is
now expressed by line like:
      <Mime>text/x-ldif</Mime>

With the updated trid definition now my XCF graphics are described like before
but with a special mime type. TrID definitions and output are stored in
archive ldif_.zip. I hope that my definition can be used in future version of
triddefs.

With best wishes
Jörg Jenderek


Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Thanks!

I'm updating the ldif.trid.xml definition, but the -rem one with just the string "OBJECT" seems a bit too generic.