VBR resource hog!!!

The place for "I can't figure out how to..." questions.
Post Reply
filmo
User
Posts: 22
Joined: Fri Dec 08, 2006 7:16 pm
Location: Los Angeles

VBR resource hog!!!

Post by filmo » Mon Jun 16, 2008 8:58 pm

Hi,

How would I disable the 'file walk' on VBR files that don't have headers?

We allow clients to upload MP3 files, and sometimes they upload VBR files without VBR headers. It is killing our server when this happens as our scripts try to parse the VBR files (30+ MB to 100+ MB per each) frame by frame. (They upload hour long files for transcription and walking a 60MB VBR file is a non-trivial resource hog.)

I'd like to set it up so that with files above a certain size in MB, that if getID3 discovers it needs to walk the file, it instead just sends an error for that file instead of attempting the walk. (We would then run a LAME conversion on the VBR file to convert it to a CBR which would solve the problem.)

Any help would be appreciated, even if it's just pointing at the right section of code that determines when to 'walk' frame-by-frame on a VBR file.

Allan Hansen
getID3() v2 developer
Posts: 445
Joined: Sun May 04, 2003 2:22 pm
Location: Holmegaard, Denmark

Post by Allan Hansen » Mon Jun 16, 2008 9:21 pm

Try line 21 in mp3 module.

Might not be the correct one though.

filmo
User
Posts: 22
Joined: Fri Dec 08, 2006 7:16 pm
Location: Los Angeles

Line 21

Post by filmo » Mon Jun 16, 2008 9:49 pm

All line 21 does is define how many frames to scan to determine if there's a VBR header or not.

Need to find the code that occurs after it is determined that a VBR header is NOT found and that a full walk (from the first frame to the last frame in the file) will be needed.

At some point in the code after a header is not found, a walk through the entire file occurs to determine the bit rate. Seems like "RecursiveFrameScanning" at line 1070 might be involved, but I haven't read all the code to fully understand the logic yet.

Once that area is isolated, I'm going to insert a file size check which will then break out of the walk if size is greater than X.

filmo
User
Posts: 22
Joined: Fri Dec 08, 2006 7:16 pm
Location: Los Angeles

Walk Trigger

Post by filmo » Wed Jun 18, 2008 5:31 pm

If anybody knows the logic well enough to point me in the right direction please let me know. Thanks.

James Heinrich
getID3() v1 developer
Posts: 1476
Joined: Fri May 04, 2001 4:00 pm
Are you a spambot?: no
Location: Northern Ontario, Canada
Contact:

Post by James Heinrich » Wed Jun 18, 2008 11:49 pm

I haven't looked at that part of the code for a while, but I believe that around line 876 of module.audio.mp3.php if you change

Code: Select all

if ($recursivesearch) {
to something like

Code: Select all

if ($recursivesearch && (filesize($file) < 999999)) {
that should do what you're looking for.

filmo
User
Posts: 22
Joined: Fri Dec 08, 2006 7:16 pm
Location: Los Angeles

$recursivesearch

Post by filmo » Thu Jun 19, 2008 6:44 pm

Thanks, I'll take a look this afternoon and see if that's the trigger area. I'll post my code changes back.

filmo
User
Posts: 22
Joined: Fri Dec 08, 2006 7:16 pm
Location: Los Angeles

Problem Area is the bitrate histogram

Post by filmo » Sat Jun 21, 2008 1:36 am

I've run some tracking on the module and the code near 876 isn't the culprit. It seems designed to find CBR files hiding as VBR:

Here's my modification to that area that didn't make a difference in terms of looping:

Code: Select all

if ($recursivesearch) {
	$thisfile_mpeg_audio['bitrate_mode'] = 'vbr';
	if (getid3_mp3::RecursiveFrameScanning($fd, $ThisFileInfo, $offset, $nextframetestoffset, true) && $ThisFileInfo['filesize'] < MAX_VBR_FILE_SIZE_TO_SCAN) {
		$recursivesearch = false;
		$thisfile_mpeg_audio['bitrate_mode'] = 'cbr';
	}
	if ($thisfile_mpeg_audio['bitrate_mode'] == 'vbr') {
		$ThisFileInfo['warning'][] = 'VBR file with no VBR header. Bitrate values calculated from actual frame bitrates.';
	}
}
The area that IS using up the resources is around line 1461 and is used to build the bit rate histogram.

Code: Select all

$FastMode = false;
$SynchErrorsFound = 0;
while (getid3_mp3::decodeMPEGaudioHeader($fd, $synchstartoffset, $dummy, false, false, $FastMode)) {
	$FastMode = true;
	$thisframebitrate =    $MPEGaudioBitrateLookup[$MPEGaudioVersionLookup[$dummy['mpeg']<-- Snip -->['bitrate']];

	if (empty($dummy['mpeg']['audio']['framelength'])) {
		$SynchErrorsFound++;
	} else {
		$ThisFileInfo['mpeg']['audio']['bitrate_distribution'][$thisframebitrate]++;
		$ThisFileInfo['mpeg']['audio']['stereo_distribution'][$dummy['mpeg']['audio']['channelmode']]++;
		$ThisFileInfo['mpeg']['audio']['version_distribution'][$dummy['mpeg']['audio']['version']]++;

		$synchstartoffset += $dummy['mpeg']['audio']['framelength'];
	}
}
Because $synchstartoffset gets incremented until the EOF is found, this loop takes forever on large files.

On an 8 MB VBR file that was 62 minutes long (11khz @ 17.3 VBR Bitrate), it walked through this loop 72,415 times to build the bitrate histogram. WOW!!!

Unfortunately, it doesn't seem like it's simply a matter wrapping this code in:

Code: Select all

if ($ThisFileInfo['filesize'] < MAX_VBR_FILE_SIZE_TO_SCAN) {

   LOOPING CODE BLOCK

}
As the $bittotal and $framecounter are important in determining the length of the file.

I'm going to try a few more things, but any advice on the ramifications of turning off the bitrate histogram routine when filesize > X, or any thoughts on how to solve this problem elegantly, please let me know. My concerns are mostly with the unintended consequences. I don't doubt that I can finagle something that works with headerless VBR files, but I'm worried about breaking it for something I'm not anticipating. Thanks.

One other thought was to wonder if there's a c-library type call that can be used instead of PHP to scan the file when it is necessary to build the bitrate via a full file scan. When I open said VBR file using any desktop program, it's able to instantly determine the bit rate which leads me to believe that a compiled solution might be the right answer in this area. (PHP is taking between 10 to 12 seconds for small files to 120 to 180 second for large files. (e.g. 100MB VBR files))

James Heinrich
getID3() v1 developer
Posts: 1476
Joined: Fri May 04, 2001 4:00 pm
Are you a spambot?: no
Location: Northern Ontario, Canada
Contact:

Post by James Heinrich » Sun Jun 22, 2008 6:57 pm

Around line 1457, replace the code block with this:

Code: Select all

$FastMode = false;
$SynchErrorsFound = 0;
$frames_scanned   = 0;
$max_frames_scan  = 5000;
while (getid3_mp3::decodeMPEGaudioHeader($fd, $synchstartoffset, $dummy, false, false, $FastMode)) {
	$FastMode = true;
	$thisframebitrate = $MPEGaudioBitrateLookup[$MPEGaudioVersionLookup[$dummy['mpeg']['audio']['raw']['version']]][$MPEGaudioLayerLookup[$dummy['mpeg']['audio']['raw']['layer']]][$dummy['mpeg']['audio']['raw']['bitrate']];

	if (empty($dummy['mpeg']['audio']['framelength'])) {
		$SynchErrorsFound++;
	} else {
		@$ThisFileInfo['mpeg']['audio']['bitrate_distribution'][$thisframebitrate]++;
		@$ThisFileInfo['mpeg']['audio']['stereo_distribution'][$dummy['mpeg']['audio']['channelmode']]++;
		@$ThisFileInfo['mpeg']['audio']['version_distribution'][$dummy['mpeg']['audio']['version']]++;

		$synchstartoffset += $dummy['mpeg']['audio']['framelength'];
	}
	if ($max_frames_scan && (++$frames_scanned >= $max_frames_scan)) {
		$pct_data_scanned = (ftell($fd) - $ThisFileInfo['avdataoffset']) / ($ThisFileInfo['avdataend'] - $ThisFileInfo['avdataoffset']);
		$ThisFileInfo['warning'][] = 'too many MPEG audio frames to scan, only scanned first '.$max_frames_scan.' frames ('.number_format($pct_data_scanned * 100, 1).'% of file) and extrapolated distribution, playtime and bitrate may be incorrect.';
		foreach ($ThisFileInfo['mpeg']['audio'] as $key1 => $value1) {
			if (!eregi('_distribution$', $key1)) {
				continue;
			}
			foreach ($value1 as $key2 => $value2) {
				$ThisFileInfo['mpeg']['audio'][$key1][$key2] = round($value2 / $pct_data_scanned);
			}
		}
		break;
	}
}
if ($SynchErrorsFound > 0) {
	$ThisFileInfo['warning'][] = 'Found '.$SynchErrorsFound.' synch errors in histogram analysis';
	//return false;
}
That will scan at most 5000 frames (you can, of course, change this to what you think is a reasonable number) and extrapolate the probable bitrate distribution histograms (and thereby playtime and average bitrate) based on the first part of the file. Naturally, the less of the file you scan the more likely you are to get inaccurate results, but this will greatly speed things up. For example, my 97:31 test file (19.6kbps, 112484 frames) had these results:

Code: Select all

ScannedFrames  ScanTime  Bitrate  Playtime   Error
112484         6.184     19691    97:31      0.00%
100000         5.440     19767    97:09      0.39%
 75000         4.209     20018    95:55      1.67%
 50000         2.896     20298    94:36      3.08%
 20000         1.321     20632    93:04      4.78%
 10000         0.800     21021    91:21      6.75%
  5000         0.536     20620    93:07      4.71%
  1000         0.324     20415    94:03      3.68%

filmo
User
Posts: 22
Joined: Fri Dec 08, 2006 7:16 pm
Location: Los Angeles

Worked Great!!

Post by filmo » Mon Jun 23, 2008 7:13 pm

Your code does the trick and fixes the problem. Thanks.

Do you think sampling the file at different locations would be more accurate than all the frames from the head?

For example, if $max_frames_scan = 15000;

Would it be more accurate to scan the first 15000 frames or divide that into 3 pieces and increment the $synchstartoffset offset distance.

Another words, scan the first 5000 frames, and then move the synchstartoffset forward to about the 1/2 way point, read another 5000 frames, and then read the last 5000 frames.

Thus on a 100,000 frame file, you'd read the following frames:

0-5000
(100,000/2 - 5000/2) = 47,500 to 52,500
95,000 to 100,000
= total of 15,000 frames scanned at 3 separate locations?

I'm going to give coding that a shot building off your fix. Thanks again. It's a huge difference in processing overhead on large files and relatively accurate given the large file lengths.

James Heinrich
getID3() v1 developer
Posts: 1476
Joined: Fri May 04, 2001 4:00 pm
Are you a spambot?: no
Location: Northern Ontario, Canada
Contact:

Re: Worked Great!!

Post by James Heinrich » Tue Jun 24, 2008 12:35 am

filmo wrote:Do you think sampling the file at different locations would be more accurate than all the frames from the head?
Yes. Scanning 1000 frames from 5 places in the file would be far more likely to be accurate than scanning 5000 frames from one place in the file. The more different locations you sample, even if the total number of frames scanned remains constant, the more likely your overall average will be accurate. Unfortunately, the more locations you scan the more overhead you have. Each time you seek to a new position in the file, you're not certain exactly which byte offset will contain a the nearest start of MPEG frame, so you'll need to scan through the next <1kB to find the start of the next frame and keep scanning from there (also, you really should verify that the apparent start-of-frame is genuine and not a false synch, which adds more ovehead). This seeking and scanning for framestart would add on overhead probably equivalent to several hundred frames of scanning, but should pay off with increased accuracy.

I will see if I can modify the codeblock to work properly with multi-position sampling.

James Heinrich
getID3() v1 developer
Posts: 1476
Joined: Fri May 04, 2001 4:00 pm
Are you a spambot?: no
Location: Northern Ontario, Canada
Contact:

Post by James Heinrich » Tue Jun 24, 2008 12:22 pm

Hmm, my theories prove somewhat inaccurate.

First, time to scan is very closely related to number of frames, jumping to different sections of the file doesn't add significant overhead. Testing with 5000 frames total in different groupings:

Code: Select all

seg  secs   time  frames    bps   err
-actual-   97:31, 112484, 19691, 0.00% 
 1, 0.534, 93:07, 107410, 20621, 4.72%
 2, 0.533, 97:33, 112496, 19682, 0.05%
 3, 0.523, 96:41, 111474, 19860, 0.86%
 4, 0.543, 97:35, 112490, 19678, 0.07%
 5, 0.528, 99:21, 114520, 19328, 1.84%
10, 0.535, 95:48, 110356, 20043, 1.79%
As with any sampling, there is always chance of error. While in the general sense larger sample sizes and/or samples from more points should produce a more accurate estimate, there is always chance for error. Although you can see in my sample file that sampling 5000 frames from multiple points in the file gives error rates much closer to the previous results where I was scanning the first 75000 frames, so I'm fairly pleased with that. Obviously for the most accurate results you would want to scan 100% of the file, but this seems an acceptable compromise for speed.

My code, replacing approx lines 1451-1473 of v1.7.8b2, looks like this:

Code: Select all

$dummy = array('error'=>$ThisFileInfo['error'], 'warning'=>$ThisFileInfo['warning'], 'avdataend'=>$ThisFileInfo['avdataend'], 'avdataoffset'=>$ThisFileInfo['avdataoffset']);
$synchstartoffset = $ThisFileInfo['avdataoffset'];
fseek($fd, $ThisFileInfo['avdataoffset'], SEEK_SET);

// you can play with these numbers:
$max_frames_scan  = 50000;
$max_scan_segments = 10;

// don't play with these numbers:
$FastMode = false;
$SynchErrorsFound = 0;
$frames_scanned   = 0;
$this_scan_segment = 0;
$frames_scan_per_segment = ceil($max_frames_scan / $max_scan_segments);
$pct_data_scanned = 0;
for ($current_segment = 0; $current_segment < $max_scan_segments; $current_segment++) {
	$frames_scanned_this_segment = 0;
	if (ftell($fd) >= $ThisFileInfo['avdataend']) {
		break;
	}
	$scan_start_offset[$current_segment] = max(ftell($fd), $ThisFileInfo['avdataoffset'] + round($current_segment * (($ThisFileInfo['avdataend'] - $ThisFileInfo['avdataoffset']) / $max_scan_segments)));
	if ($current_segment > 0) {
		fseek($fd, $scan_start_offset[$current_segment], SEEK_SET);
		$buffer_4k = fread($fd, 4096);
		for ($j = 0; $j < (strlen($buffer_4k) - 4); $j++) {
			if (($buffer_4k{$j} == "\xFF") && ($buffer_4k{($j + 1)} > "\xE0")) { // synch detected
				if (getid3_mp3::decodeMPEGaudioHeader($fd, $scan_start_offset[$current_segment] + $j, $dummy, false, false, $FastMode)) {
					$calculated_next_offset = $scan_start_offset[$current_segment] + $j + $dummy['mpeg']['audio']['framelength'];
					if (getid3_mp3::decodeMPEGaudioHeader($fd, $calculated_next_offset, $dummy, false, false, $FastMode)) {
						$scan_start_offset[$current_segment] += $j;
						break;
					}
				}
			}
		}
	}
	$synchstartoffset = $scan_start_offset[$current_segment];
	while (getid3_mp3::decodeMPEGaudioHeader($fd, $synchstartoffset, $dummy, false, false, $FastMode)) {
		$FastMode = true;
		$thisframebitrate = $MPEGaudioBitrateLookup[$MPEGaudioVersionLookup[$dummy['mpeg']['audio']['raw']['version']]][$MPEGaudioLayerLookup[$dummy['mpeg']['audio']['raw']['layer']]][$dummy['mpeg']['audio']['raw']['bitrate']];

		if (empty($dummy['mpeg']['audio']['framelength'])) {
			$SynchErrorsFound++;
			$synchstartoffset++;
		} else {
			@$ThisFileInfo['mpeg']['audio']['bitrate_distribution'][$thisframebitrate]++;
			@$ThisFileInfo['mpeg']['audio']['stereo_distribution'][$dummy['mpeg']['audio']['channelmode']]++;
			@$ThisFileInfo['mpeg']['audio']['version_distribution'][$dummy['mpeg']['audio']['version']]++;

			$synchstartoffset += $dummy['mpeg']['audio']['framelength'];
		}
		$frames_scanned++;
		if ($frames_scan_per_segment && (++$frames_scanned_this_segment >= $frames_scan_per_segment)) {
			$this_pct_scanned = (ftell($fd) - $scan_start_offset[$current_segment]) / ($ThisFileInfo['avdataend'] - $ThisFileInfo['avdataoffset']);
			if (($current_segment == 0) && (($this_pct_scanned * $max_scan_segments) >= 1)) {
				// file likely contains < $max_frames_scan, just scan as one segment
				$max_scan_segments = 1;
				$frames_scan_per_segment = $max_frames_scan;
			} else {
				$pct_data_scanned += $this_pct_scanned;
				break;
			}
		}
	}
}
if ($pct_data_scanned > 0) {
	$ThisFileInfo['warning'][] = 'too many MPEG audio frames to scan, only scanned '.$frames_scanned.' frames in '.$max_scan_segments.' segments ('.number_format($pct_data_scanned * 100, 1).'% of file) and extrapolated distribution, playtime and bitrate may be incorrect.';
	foreach ($ThisFileInfo['mpeg']['audio'] as $key1 => $value1) {
		if (!eregi('_distribution$', $key1)) {
			continue;
		}
		foreach ($value1 as $key2 => $value2) {
			$ThisFileInfo['mpeg']['audio'][$key1][$key2] = round($value2 / $pct_data_scanned);
		}
	}
}

if ($SynchErrorsFound > 0) {
	$ThisFileInfo['warning'][] = 'Found '.$SynchErrorsFound.' synch errors in histogram analysis';
	//return false;
}
You can play with the number of frames and segments to scan and see how it performs for you. I think for now I'm going to leave the default at 40000 frames in 10 segments.

edit: refined code to make it work better for short headerless VBR files (if whole file has < $max_frames_scan frames it doesn't make sense to seek-and-scan in multiple places).

Post Reply