In many cases you can but you DO have to analyse the contents of the file.
Where I work we have a java implementation of the unix "file" command that parses the file called "magic" that the "file" command uses.
the "magic" file is available on most unix systems or from the web
e.g. http://www.garykessler.net/library/magic.html
The magic file describes magic numbers (and other indicators) that exist in files that enable them to be identified by mime-type.
No I can't give it to you.
matfud
yup you're correct. It is not definitive. It cannot correctly identify all file types.
However it is vey good.
It no longer just looks for magic numbers in the first few bytes. It allows analysis of offsets from known places. Computed offsets (such as read byte 24 as an integer as it tells you the location of another piece of information you may need) and a few other tricks. It also includes hierarcichal descriptions such as find marker A and it indicates a bla/foo type of file. Then you have derivatives of this original check that can refine that classification to bla/foo-a or bla/foo-b.
The magic file is now a simple programming language. Its not turing complete (as far as I know) but it does allow a lot of flexibility in how you describe the markers you want to look for.
It can never be 100% precise but it is very good.
The set we currently use does not handle all of the options available in the magic file but can still result in reading upto 30K into the files (rare occasions most are resolved in a few hundred bytes). It can be very expensive to use but can really help when the headers we recieve do not seem to match the content or more often when we know the remote site does not provide correct headers (happens more often then you might think)
matfud
Message was edited by:
matfud