Author Topic: Smallest files that can be identified  (Read 18193 times)

m^2

  • Newbie
  • *
  • Posts: 10
Smallest files that can be identified
« on: December 29, 2006, 11:26:15 PM »
There's no way to identify file that has a couple of bytes. But in many cases 100 is enough. Do you have any idea, by your knowledge / experience, where is the lower boundary for identifiable file size?
I don't call TrID for files under 16 bytes, but, for permormance reasons, it's important for me to make this limit as high as possible - without loosing correctness.
I could look for the shortest signature in your definitions, but this doesn't give me certainty that in future there'll be no shorter one, so it's rather generic, not technical question.

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Re: Smallest files that can be identified
« Reply #1 on: December 29, 2006, 11:35:28 PM »
Not an easy question. Some filetypes can be identified by very little patterns: Lua compiled scripts, for example, take just 4 bytes (sure that's just the unique header, the smallest real script will be larger). Same for RIFF containers. I think that some filetypes with just a couple of bytes as header, and another two as a minimum content could exist, so probably 4 bytes will probably be a "right" minimum. It could also be configurable, anyway.
« Last Edit: December 29, 2006, 11:37:09 PM by Mark0 »

m^2

  • Newbie
  • *
  • Posts: 10
Re: Smallest files that can be identified
« Reply #2 on: December 29, 2006, 11:40:54 PM »
Not an easy question. Some filetypes can be identified by very little patterns: Lua compiled scripts, for example, take just 4 bytes (sure that's just the unique header, the smallest real script will be larger). Same for RIFF containers. I think that some filetypes with just a couple of bytes as header, and another two as a minimum content could exist, so probably 4 bytes will probably be a "right" minimum. It could also be configurable, anyway.
Are you sure that probability of incorrect identification of 32-bit file is low?
I don't think that configurability is a good idea - there IS a lower limit (bigger than 0 ;) ) and no configuration changes it.
ADDED:
Let's say that "low" means no more than 60% ;)
« Last Edit: December 29, 2006, 11:45:15 PM by m^2 »

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Re: Smallest files that can be identified
« Reply #3 on: December 29, 2006, 11:47:04 PM »
Quote from: m^2
Are you sure that probability of incorrect identification of 32-bit file is low?
Sorry, I'm not sure to understand what you mean (you know, my English...)... ?

m^2

  • Newbie
  • *
  • Posts: 10
Re: Smallest files that can be identified
« Reply #4 on: December 29, 2006, 11:54:09 PM »
In another words:
Are you sure that when TrID gets a 32-bit file and says that it's i.e. compiled LUA script, I can believe it really is?

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Re: Smallest files that can be identified
« Reply #5 on: December 30, 2006, 12:14:27 AM »
For Lua, yes. Some filetypes have pretty strong characterization.
With others, you can't be too sure even if you have a couple of MB availabe.

m^2

  • Newbie
  • *
  • Posts: 10
Re: Smallest files that can be identified
« Reply #6 on: December 30, 2006, 12:21:13 AM »
Thank you, I'll have to lower the limit I use.