Extracting metadata from fragments or corrupted files.

Locked
marcbraulio
User
Posts: 11
Joined: Thu Oct 18, 2012 7:15 pm
Are you a spambot?: no

Extracting metadata from fragments or corrupted files.

Post by marcbraulio » Tue Dec 04, 2012 8:00 pm

Good day, getID3 is just plain awesome but there is something that is puzzling me. It was my understanding that getID3 could read corrupted files as well as fragments of a file. So far I have had great success with that, however there are some files that apparently need to be absolutely intact in order for any significant metadata to be extracted. Taking out even just a single byte from anywhere in the file (beginning, middle, or end) renders the file completely useless for metadata extraction. A good example of such file is the "Big Buck Bunny IPod 5G 320×180, 62 MB" MP4 video (download: http://download.blender.org/peach/bigbu ... 20x180.mp4). Could anyone explain to me why getID3 needs every last bit of binary information in order to extract the metadata from this file?

I am working with the following code to get fragments of a file (reference: viewtopic.php?t=1197)

Code: Select all

$filename = tempnam('/tmp','getid3');
if (file_put_contents($filename, file_get_contents('http://getid3.org/demo/test.mp3', false, null, 0, 32768))) {
   if (require_once('/path/to/getid3/getid3.php')) {
      $getID3 = new getID3;
      $ThisFileInfo = $getID3->analyze($filename);
      echo '<pre>'.print_r($ThisFileInfo, true).'</pre>';
   }
   unlink($filename);
}
Thank you, any pointers are appreciated.

James Heinrich
getID3() v1 developer
Posts: 1444
Joined: Fri May 04, 2001 4:00 pm
Are you a spambot?: no
Location: Northern Ontario, Canada
Contact:

Re: Extracting metadata from fragments or corrupted files.

Post by James Heinrich » Tue Dec 04, 2012 9:04 pm

Depending on the file format and where the corruption / truncation occurs, getID3 (or any program) will have varying degrees of success analyzing it. Truncating or overwriting bytes at the very beginning of the file (most especially the first 4 bytes, but generally anything in the first few hundred bytes, depending on file format) will generally cause havoc and may render the file too corrupt for analysis. Truncation of the file beyond the header (again, depending on file format that could be a few dozen bytes in, or a few MB in from the front of the file) is generally "safe", with the caveat that playtime, bitrate and of course any data store at the end of the file (e.g. ID3v1) would be missing or wrong. The actual data portion of any file format is not parsed (MP3 frame headers are usually (at least the first couple dozen) parsed to confirm a valid bitstream has been located; the actual audio data is not parsed).

MP4 container is based on the Quicktime file format which is probably the most convoluted recursive container format that getID3 supports, so it wouldn't entirely surprise me that it could be "fragile". You should be able to write ASCII art all over the data portion of the video and getID3 won't care, as long as the file structure remains intact. However, due to the excessively-recursive nature of Quicktime file format, it behaves poorly with truncated files. Taking your BigBuckBunny example, you'll get almost nothing out of it because the vast majority of the file is contained inside the "mdat" chunk which spans from offsets 28 - 64,656,780 (which is basically the whole file), so anything inside that won't be parsed because getID3 can't read the complete chunk and has to assume that something is corrupt (which it is). The last 247 bytes of the file contain the text metadata ("Big Buck Bunny", "Blender Foundation", etc) so if they would have been inserted before the mdat chunk they would be accessible (they're not nested), but text metadata is often added at the end of the file to minimize rewriting if the metadata is changed (just a few hundred bytes need to be rewritten, not gigabytes).

So, short answer: Quicktime file format behaves poorly (worse than most other file formats) in regards to truncation.

marcbraulio
User
Posts: 11
Joined: Thu Oct 18, 2012 7:15 pm
Are you a spambot?: no

Re: Extracting metadata from fragments or corrupted files.

Post by marcbraulio » Tue Dec 04, 2012 11:32 pm

Thank you for the in-depth response as well as the short answer, very educational and helpful. Clearly, I will have to rethink my strategy as I was hoping to analyze video files stored in Amazon S3 (that could potentially be in the 40gb+ range) without having to copy it over to the server.

James Heinrich
getID3() v1 developer
Posts: 1444
Joined: Fri May 04, 2001 4:00 pm
Are you a spambot?: no
Location: Northern Ontario, Canada
Contact:

Re: Extracting metadata from fragments or corrupted files.

Post by James Heinrich » Wed Dec 05, 2012 12:04 am

You might (emphasis supplied, this is untested) have success by "faking" the local file, for example by downloading the first and last 1MB sections and writing a local temp file padded to the correct byte length with null data in the middle. As in, write the first 1MB, write (filesize-2MB) of null, write the last 1MB. Even then, a 40GB temp file is pretty hefty, so you may have to experiment whether this approach is feasible or not.

marcbraulio
User
Posts: 11
Joined: Thu Oct 18, 2012 7:15 pm
Are you a spambot?: no

Re: Extracting metadata from fragments or corrupted files.

Post by marcbraulio » Wed Dec 05, 2012 7:50 pm

Thanks for the tip, much appreciated. I ran some tests on "Big Buck Bunny IPod 5G 320×180, 62 MB" and apparently "faking it" works flawlessly! I'll post an update if I come across a file that will fail when I try to "fake it".

marcbraulio
User
Posts: 11
Joined: Thu Oct 18, 2012 7:15 pm
Are you a spambot?: no

Re: Extracting metadata from fragments or corrupted files.

Post by marcbraulio » Wed Dec 05, 2012 8:20 pm

marcbraulio wrote:Thanks for the tip, much appreciated. I ran some tests on "Big Buck Bunny IPod 5G 320×180, 62 MB" and apparently "faking it" works flawlessly! I'll post an update if I come across a file that will fail when I try to "fake it".
The only issue I am having right now, which is unrelated to getID3 is the fact that apparently there is no way to get just the ending bytes of a remote file. If the file is already in the local system, I could use php's "fseek", but otherwise it seems to be nearly impossible. If you have any ideas on how I can read parts of remote file, please don't hesitate to share them. Many thanks!

James Heinrich
getID3() v1 developer
Posts: 1444
Joined: Fri May 04, 2001 4:00 pm
Are you a spambot?: no
Location: Northern Ontario, Canada
Contact:

Re: Extracting metadata from fragments or corrupted files.

Post by James Heinrich » Wed Dec 05, 2012 8:44 pm

Actually fseek also works if you fopen an HTTP URL. fseek($fp, -1000000, SEEK_END) may or may not work (untested), but since you know the filesize it shouldn't be a problem to seek to the absolute offset of where you want, e.g. fseek($fp, 123465789763543, SEEK_SET) -- assuming you have 64-bit PHP that is (32-bit PHP doesn't play with files larger than 4GB, and may not play nice with files larger than 2GB).

You can also investigate using curl --range 0-1000000 and curl --range -1000000, or curl --range 0-1000000,-1000000 if you can handle multipart responses.
http://curl.haxx.se/docs/manpage.html

marcbraulio
User
Posts: 11
Joined: Thu Oct 18, 2012 7:15 pm
Are you a spambot?: no

Re: Extracting metadata from fragments or corrupted files.

Post by marcbraulio » Wed Dec 05, 2012 11:55 pm

That's strange, I had zero success getting any variation of fseek to work with remote files... take the following example for instance:

Code: Select all

$fh = fopen("http://download.blender.org/peach/bigbuckbunny_movies/BigBuckBunny_320x180.mp4", "rb");

fseek($fh, 5242880, SEEK_SET);

$data = fread($fh, 5242880);

fclose($fh);
PHP will throw the following error: Warning: fseek(): stream does not support seeking in C:\wamp\www\metadata.php on line 27

The PHP manual used to say that fseek would not work over HTTP or FTP, I can't find reference to that anymore.

I did however, a few hours ago, manage to get just the end of the file by using the cURL with "CURLOPT_RANGE" option that you mentioned, but from what I understand this will only work if the server is willing to obey the "Content-Range" header, which in my case it does because Amazon S3 is configured that way. Thank you so very much for all your help, with all do honesty, you are incredibly knowledgeable and helpful.

Locked