high.definition.x264.standards.revision.3.1.addendum.1.nfo

High.Definition.x264.Standards.Revision.3.1.Addendum.1-HDX

As there lately have been several pres with the wrong number of reference frames
we thought we'd explain the reasoning behind the rule set.

On various x264 pages on the internet the following can be read about reference frames:

"Selects the maximum number of reference frames that can be used. Referenced frames
are frames that refer to other frames (eg. if both frames are similar). Having a high
referenced frame will improve quality but slow up encoding. For typical content,
a reference frame of 3 to 5 is recommended. For content with a lot of repetition
(eg. animation), a reference frame of 8 to 10 can be used."

So, higher reference frames means higher quality, why do we then enforce max 4
reference frames on high resolution video? It has to do with hardware players.
The popcorn hour, twix, wd etc. all support Level 4.1 (L4.1) of the ITU-T h264
specification. All graphic cards that have DXVA also support L4.1. So let's see
what L4.1 says about reference frames.

In table 'A-1 Level limits' on page 283 (pdf 305) of the ITU-T specification it
says that MaxDPB for L4.1 is 12288 KiB. MaxDPB is the Maximum Decoded Picture Buffer,
which is the largest size allowed of the decoded picture buffer when decoding a video.
By supporting L4.1, the hardware players must have at least 12 MiB of buffer for
storing the decoded pictures. This means that a video that requires a buffer of
13 MiB is not guaranteed to work on one of these players.

As 16 * 16 pixels macroblocks are used, all resolutions needs to be mod16, for ease of
reading, the maths to make them mod16 is not included below.

The DPB in KiB is calculated as follows:

DPB = vertical resolution * horizontal resolution * 1.5 * reference frames / 1024

If we transform this formula to get the reference frames instead we get:

ref = 12288 * 1024 / (vertical resolution * horizontal resolution * 1.5)

We of course can't use partial frames for referencing and thus the reference frames
should be rounded down to the closest integer. We can also transform this to get the
maximum vertical resolution for a specific reference frames value, here we need to
round the vertical resolution down to mod16:

vertical res = 12288 * 1024 / (horizontal resolution * 1.5 * reference frames)

With the above formula we can conclude that the highest vertical resolution that we
can have ref 5 on and still be L4.1 compliant is 864 pixels.

873.813333 = 12288 * 1024 / ( 1920 * 1.5 * 5 )
864 = floor( 873.813333 / 16 ) * 16

The 1.5 in the calculations above is the YV12 colourspace, it needs 12 bits to store
1 pixel. In other words, 1.5 bytes per pixel.

So, to conclude this, the reason we put ref 4 as max for movies with vertical resolution
greater than 864 in rules is not because we want to be able to encode releases faster.
It's because we want releases to be L4.1 compliant and thus possible to play on the
popcorn hour, twix and other hardware players. And we require at least ref 5 on all
videos where it's possible while still respecting L4.1, this to ensure high quality.

ITU-T specification:
http://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-H.264-200711-I!!PDF-E&type=items

12 bits per pixel for YV12:
http://msdn.microsoft.com/en-us/library/aa904813.aspx#yuvformats_420formats_12bitsperpixel