The art of file type identification
Recently I’ve been working with maintainer of file(1) to merge in various file magic identification floating around on internet. Though I can’t say this is exciting experience (‘boring’ is the better word to use, involving searching for spec everywhere and searching for files to test, except when many proprietary formats are undocumented), such ongoing work allowed me to more firmly grasp how things can improve.
Many misidentification arises from incorrect or lazy magic entries. There are 2 main sources of problem:
- Overbroad identification
- Matching against word in spoken languages
Overbroad identification
Overbroad identification can be subdivided into 4 categories:
- searching for just one byte or two
- searching for simple patterns possibly used by many other kind of files
- match against wide range of bytes
- premature positive announcement
As example, DOS .com executables and MPEG related entries are over-zealous in their matching. Too many DOS executables are matching the first byte only. Even worse, this first byte has wide range of choices!
| Byte | Type | Value | Identification string |
|---|---|---|---|
| 0 | byte | 0xe9 | DOS executable (COM) |
| 0 | byte | 0×8c | DOS executable (COM) |
| 0 | byte | 0xeb | DOS executable (COM) |
| 0 | byte | 0xb8 | DOS executable (COM) |
Personally, I found all Windows icon files becoming ‘MPEG sequence’ — because both have header 0x00000100, and some UTF-16 text becoming MPEG again as certain MPEG format only requires match against first 2 bytes of file for 0xFFFE, thus clobbering Unicode BOM. (The latter problem is recently fixed, while I submitted fix for former one)
In many cases, file(1) is already yelling “Hey, I got something!” even though the file type identification process is still ongoing, checking for correct file subtype. Bitmap image is a good example:
| Byte | Type | Value | Identification string | |
|---|---|---|---|---|
| 0 | string | “BM” | PC bitmap data | |
| >14 | leshort | 40 | \b, Windows 3.x format | |
Upon inspection of first 2 bytes, premature announcement of “PC bitmap data” is made, even though 15th byte must fit some certain value before it can be confirmed this is Windows bitmap.
Matching against word in spoken languages
The most obvious cases are matching against English word. European languages also count here. (For Asian languages, read on.) QuickTime is prominent example:
| Byte | Type | Value | Identification string |
|---|---|---|---|
| 4 | string | “wide” | Apple QuickTime movie (unoptimized) |
| 4 | string | “skip” | Apple QuickTime movie (modified) |
| 4 | string | “free” | Apple QuickTime movie (modified) |
| 4 | string | “idat” | Apple QuickTime movie (unoptimized) |
Even though it matches specific location only (here it requires the word occur at 5th byte), the chance of having text file with these words at such places are not to be underestimated. OK, so ‘idat’ is not a word, but how about ‘validate’ and ’solidate’? Note that file identification is mostly a ’substring’ match against a binary. In this context, text files and other ASCII documents are affected most.
For Asian ones, this problem can be argued to be more serious or less serious. Why? Many Asian text encodings, like Big5, GB2312, KSC5601, Shift JIS etc are 8-bit in nature, so may be misidentified by other normal magic entries, but not by ASCII identifiers. Right now there is no statistical data to support either side.
Intentional deceiving
Not only laxed rules caused false positive, anti-forensic tools are known to use this technique to deceive forensic software, preventing them from correctly identifying file type. As a preliminary example on how to cheat, look at these commands:
$ file /a.bmp
/a.bmp: PC bitmap data
$ echo ‘MZ123123123123123123123123123123123′ > /a.exe
$ file /a.exe
/a.exe: MS-DOS executable
It can be done because the basic identification involves the first 2 bytes only. Any text file or binary starting with “BM” and “MZ” are bitmap and Windows binary respectively.
Metasploit has been known to announce the transmogrify software, which does exactly such pretentious act. For some reason, rumours are still floating everywhere, google cache used to point to old version of Metasploit Anti-forensics project, but nowhere can I see the real thing, nor does Metasploit mention anything remotely close now. Cease and desist letter from lawyers? Vaporware? Who knows.
But from anti-forensics view point, this strategy sounds perfectly valid in order to waste investigators’ time or hide stuff from their eyesight.
Possible workarounds
The single most effective cure is obvious: more in-depth matching. Taking previous Windows bitmap image example, if one also checks 15th byte, probability of false identification can be reduced substantially. Match as much as possible, and delay announcement as much as possible.
Another interesting way is using probabilistic approach. This involves scanning a bunch of file type, picking the same bytes within the file, and training software to learn from a growing collection of such sample files.
In-depth matching
This is easier said than done, as one must be clear about internal structures of a file format before determining which byte to use for verification. This is particularly hard for proprietary formats or undocumented formats. Microsoft compound document format (OLE data stream) is good example: until its recent (forced) publish of format, people have to keep guessing how it can be dissected. Right now I’m still seeing the fight of this format represented as “Microsoft Office document” and “Microsoft Installer”, both of which having same header.
Another difficulty involved is when 2 different magic occur in the same file. This is considered rare, but I’ve seen it in action before, when mounting file system. Older mount(8) uses similar strategy to automatically guess file system type, saving users the trouble of specifying file system type manually. However, when a partition is formatted for certain format, and then reformatted as another file system, there is possibility that the first file system magic string is not overwritten by 2nd format attempt. But mount(8) searches for first possible magic string, thus hindering further detection. Thanks to util-linux-ng, this strategy is not used in mount anymore. (The strategy is pertained in cfdisk(8) instead, but that’s another story) Since file(1) can’t escape from such strategy yet, detecting file system image would still be challenging ahead.
One possible solution to file system detection might be reversing the detection order, checking for last magic magic word possible, then work all the way back to start of file system. Though this sounds fine, I haven’t done any verification at all.
Avoiding language word is pretty much impossible. In particular, many magic words are ASCII words, so they must be matched. Not to mention that any random 8-bit magic string may collide with arbitrary Asian text with legacy encoding that entry submitter doesn’t understand or use. As mentioned before, in-depth matching is the only rescue here.
Probabilistic approach
One very obvious weakness: it needs a really large collection of file before the identification is barely accurate. I don’t think any file identification software is using it as the single method. However, when combined with traditional method, it can be proven excellent for undocumented or proprietary formats.
TrID is a good example: it already has an accurate enough database, and further identifications rely on user submission. This user submission does not require user to study file formats; instead, run a program TrIDScan through file collection, and submit the generated XML output. From my limited understanding of TrIDScan, it seems to search for common substrings from a collection of files, and record them into XML for future TrID inclusion.
So a textfile that begins with ‘BMW’ is marked as MS Bitmap? I’m seeing it here now on a system and, come on, that can’t be correct?
How do I fix it?
As noted in the later part of my post, the most appropriate way to ensure correct detection is in-depth detection of MS bitmap structure, not just the first 2 bytes which is way too ambigious. For example, Wikipedia has provided a more detailed MS bitmap file format data. Even just checking the next 4 bytes as well as the first 2 would reduce false positive to, say, 1% or even less.
Ok, so how do I tell the file command to use a larger part of a file to figure out the type? If You know, please help
Have you tried using the newest version? If MS Bitmap is still misidentified for newest version, you can submit patch to maintainer of the file(1) program, Christos Zoulas. More info is available on its home page. And of course, you would need to understand the format of magic file used to specify file formats as well. On the other hand, if you’re just a user, there’s not much luck except asking in its mailing list and wait for somebody else to enhance the detection.