Author Topic: updated 3 ged*.trid.xml for GEDCOM Family History + variants  (Read 809 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
updated 3 ged*.trid.xml for GEDCOM Family History + variants
« on: April 27, 2023, 06:11:07 PM »
Hello trid users,

some days ago i handle some genealogical database. One format used
filename extension ged.

So i run trid utility on my GED examples. Most of my genealogical
samples are described as correctly as "GEDCOM Family History". All
look for string "0 HEAD" "near" at the beginning. When encoded as
ASCII and at the beginning like in age-all.ged then these are
identified by ged.trid.xml. The samples like lang-all.ged encoded with
UTF-8 and with a byte order mark (BOM=EFBBBF) at start are described
by ged-utf8.trid.xml with additional phrase (UTF-8). Because of BOM
these are also described with lower priority as "Text - UTF-8 encoded"
by txt-utf-8.trid.xml. The samples like char_utf16le-2.ged encoded
with UTF-16 little endian and BOM (=FFFE) at start are described by
ged-utf16.trid.xml with additional phrase (UTF-16LE).  Because of BOM
these are also described with lower priority as "Text - UTF-16 (LE)
encoded" by txt-utf-16-le.trid.xml. A few samples like
char_utf16be-2.ged are only described generic correctly as "Text -
UTF-16 (BE) encoded" by txt-utf-16-be.trid.xml. Few samples like
char_utf16be-1.ged are misidentified as "Adobe PhotoShop Brush" by
abr.trid.xml and some like char_utf16le-1.ged are described as
"Unknown!" (See appended output/trid-v-old.txt).

For comparison reason i also run file command (version 5.44) on such
samples. Here all genealogical samples are recognized. Most are
described as "GEDCOM genealogy". The samples which are not recognized
by TrID are described often also as "GEDCOM data" with keep going
option.  The tool also show some additional version and encoding (4
types: ASCII, UTF-8, UTF-16 little-endian and UTF-16 big-endian).  It
also shows mime type text/plain (see appended output/file-i-5.44.txt)
and no file name suffix (see appended output/file-ext-5.44.txt).

For comparison reason i also run the file format identification
utility DROID ( See https://sourceforge.net/projects/droid/). Here
most examples are also recognized. These are described as
"Genealogical Data Communication (GEDCOM) Format" with ged suffix and
without mime type by PUID fmt/851. The UTF-16 encoded samples are not
recognized. Also my example MY-FTW4.GED is not recognized (See
appended output/droid-ged.csv)

The samples are just text files. So the generic mime type text/plain
in principal is OK. On my PI (Debian 11 based) ged samples are
associated with application/x-gedcom according to mime shares
database. There also second suffix gedcom is listed, but i myself do
not found such examples.
On GEDCOM page on Wikipedia other types are listed. For zip compressed
variant with gdz suffix officially registered type
application/vnd.familysearch.gedcom+zip is listed. Apparently somebody
assumes that for not zipped variant
application/vnd.familysearch.gedcom is the mime type, when i looking
at iana.org such type does not exist. The correct type is expressed by
line like:
   <Mime>text/vnd.familysearch.gedcom</Mime>

Now i look why the 3 tools behave different and why. The DROID tool
looks for GEDC and VERS tag for version. Apparently in old variants
like MY-FTW4.GED these tags are missing. Further more as first test it
looks for 6 byte sequence "0 HEAD" near the beginning. Therefore all
UTF16 encoded samples are missed.

By ged-utf16.trid.xml samples encoded with UTF-16 little endian and
BOM are described. This was expressed by XML construct like:
   <Bytes>FFFE3000200048004500410044</Bytes>
   <ASCII> . . 0 .   . H . E . A . D</ASCII>
   <Pos>0</Pos>

By transferring this for big endian i create
ged-utf16-be_bom.trid.xml. This is recognized by XML construct like:
   <Bytes>FEFF003000200048004500410044</Bytes>
   <ASCII> . . . 0 . . . H . E . A . D</ASCII>
   <Pos>0</Pos>

Instead page on Wikipedia i use page on file formats archive team as
reference. This is now expressed by line like:
   <RefURL>http://fileformats.archiveteam.org/wiki/GEDCOM</RefURL>

There wiki page is also mentioned as link. Furthermore there links for
samples are listed. There also a link to document about GEDCOM Version
Detection is listed. Then you can see that in UTF-16 samples the BOM
can be missing. So such samples are recognized by file command.

By transferring this for big endian i create
ged-utf16-be.trid.xml. This is recognized by XML construct like:
   <Bytes>003000200048004500410044</Bytes>
   <ASCII> . 0 . . . H . E . A . D</ASCII>
   <Pos>0</Pos>

The same variant exist for little endian. So i create
ged-utf16-le.trid.xml.  This is recognized by XML construct like
   <Bytes>300020004800450041004400</Bytes>
   <ASCII> 0 . . . H . E . A . D .</ASCII>
   <Pos>0</Pos>

With the updated and variant trid definitions now my genealogical GED
examples are described (see appended output/trid-v-new.txt). TrID
definitions, few samples and output are stored in archive
ged_trid_.zip. I hope that my definition can be used in future version
of triddefs.

Unfortunately the ged suffix is also used for some graphic image
format like in HUNTERS.GED.

With best wishes
Jörg Jenderek


Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2744
    • Mark0's Home Page
Re: updated 3 ged*.trid.xml for GEDCOM Family History + variants
« Reply #1 on: April 30, 2023, 12:25:56 PM »
Thanks!