Author Topic: updated diff.trid.xml for diff output + unified variant  (Read 3383 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
updated diff.trid.xml for diff output + unified variant
« on: December 27, 2023, 09:51:14 PM »
Hello trid users,

some days ago i must handle some patch files. Unfortunately there exist about
a dozen of different variants. In this session i will handle "unified"
samples.

So i run trid utility on such examples. Some samples are not recognized (see
appended trid-v-old.txt in uni/output).

For comparison reason i also run the file format identification utility DROID
(See https://sourceforge.net/projects/droid/). Here only samples with suffix
DIF are described. By that extension according to PUID x-fmt/368 samples (like
fdiskpt.dif) are described wrong as "VisiCalc Database".

For comparison reason i also run file command (version 5.45) on such
samples. Here the samples are recognized and described as "unified diff
output" (see appended file-5.45.txt in uni/output). If running file command
with keep going option -k then with lower priority such samples are described
with second phrase as "text" (see appended file-k-5.45.txt in uni/output).
The mime type here is not text/plain but text/x-diff (see appended
file-i-5.45.txt in uni/output). Here no file name suffix is listed (see
appended file-ext-5.45.txt in uni/output). The samples recognized by TrID are
described by file command with lower priority also as "diff output" (see
appended file-k-5.45.txt in output). When i inspect such samples then these
start with 5 byte phrase "diff ". These is followed by command line options
like --git, -Naur, -urN or -uprN (see appended file.tmp in output).

On Linux according to shared MIME-info database such samples are called
"Differences between files". Here text/x-patch is used as mime type. But
text/x-diff is listed as alias of sub class text/plain. The samples are just
recognized by looking for 4 byte sequence "=== " or 5 byte "diff " at the
beginning. Here 2 suffix (diff patch) are listed. That information can be seen
in source freedesktop.org.xml.in found for example on gitlab.freedesktop.org.

Unfortunately there exist no precise documentation about this file format and
what is the difference compared with offer diff documents. In already existing
definitions like diff.trid.xml a page about Diff utility on Wikipedia is
used. Luckily there also exist a section about Unified format. So i use this
as reference. That is expressed by line like:
 <RefURL>https://en.wikipedia.org/wiki/Diff_utility#Unified_format</RefURL>

I choose the mentioned mime type from Linux shared database. That is expressed
by line like:
   <Mime>text/x-patch</Mime>

Such output are used/created by diff and patch utility. Therefore these 2
names are often used as file name suffix. On old FAT file system there exist a
8+3 limit for file names. So there the maximal length of suffix is
3. Apparently so there instead of diff dif is used and instead of patch pch is
used.  According to patch documentation if patch cannot find a place to
install that hunk of the patch, it puts the hunk out to a reject file, which
normally is the name of the output file plus a .rej suffix or similar. These
extension are also listed on https://file-extension.net/seeker/ . So i mention
this fact in remark line.

So in the end i found 5 suffix. That is expressed by line like:
   <Ext>DIFF/DIF/PATCH/PCH/REJ</Ext>

Then i look for samples of the other variant. There i found at least 3 suffix
(DIFF/PATCH/PCH). According to Mister Spock logic then DIF must also be true.
If there exist rejected patches i do not know. So i update diff.trid.xml and
file name suffix and mime type are expressed by lines like:
   <Ext>PATCH/PCH/DIFF/DIF</Ext>
   <Mime>text/x-patch</Mime>

After running tridscan i look at generated diff-unified.trid.xml. As expected
the first pattern is characteristic for some unified diff. This looks like:
   <Bytes>2D2D2D20</Bytes>
   <ASCII> - - -</ASCII>
   <Pos>0</Pos>

Unfortunately i found few examples (like ltmain-as-needed.diff) where other
text "Bug-Debian:" comes before the characteristic plus sequence. So such
samples are still not recognized by TrID command (see appended trid-v-new.txt
in else/output).

I also do not know under which conditions one of the two variants are created.
I assume this depends on the used diff options. If somebody know it then tell
it and add the fact in remark line.

TrID definition, some samples and output are stored in archive diff_.zip. I
hope that my definitions can be used in future version of triddefs.

As described at the beginning there exist some other difference output. I will
try to handle this in future session.

With best wishes
Jörg Jenderek

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2839
    • Mark0's Home Page
Re: updated diff.trid.xml for diff output + unified variant
« Reply #1 on: December 29, 2023, 01:59:02 AM »
Thanks Jörg!
I'll update the diff.trid.xml with the added extensions and MIME type.
The new diff-unified.trid.xml instead is a bit too generic and match a lot of unrelated filetypes.

jenderek

  • Sr. Member
  • ****
  • Posts: 375
Re: updated diff.trid.xml for diff output + unified variant
« Reply #2 on: December 29, 2023, 02:37:29 PM »
"The new diff-unified.trid.xml instead is a bit too generic and match a lot of
unrelated filetypes" is said by you. Can you send me such misidentified samples. The file
command in principal use the same pattern and a little bit more tests and
there recognition seems to work alright.

Thanks
Jörg

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2839
    • Mark0's Home Page
Re: updated diff.trid.xml for diff output + unified variant
« Reply #3 on: December 31, 2023, 12:10:11 PM »
A rechecked in my file repository, and you are right indeed: most of the matches I saw are sources from some more or less exotic scripting languages, and some other non-public / proprietary data formats, that seems very unlikely to be found around.
I'll add the definition.
Thanks again!