A view on VP9 and AV1 part 1: specifications

Introduction

The success of a video coding standard depends on many factors. Many articles try to benchmark the performance of codec implementations or make comments about the ecosystem of codecs. But I have not seen any article about the standardization process or the bitstream format. While the video quality is key, I believe the bitstream format is important too, but is probably less accessible or easy to communicate about.

This is a two-part analysis. Part 1 (this article) compares the HEVC and VP9/AV1 codecs in the light of their bitstream formats. Part 2 will look at some other important considerations.

Bitstream format comparison with HEVC

The VP9 bitstream format specification was frozen in June 2013. But it seems that some features (e.g. Levels) were added afterwards. AV1 derives from VP9 with most of the additions being made by the Google teams based on the developments of VP10. AV1 still moves a lot, but AV1 high-level structures likely won’t move. I expect the authors to add low level tools such as transformation types (ADCT, ADST, …) which are inferred in HEVC but can be custom in AV1; and also tools from the other codec makers (Daala from Mozilla and Thor from Cisco).

VP9 and AV1 bitstreams are quite different from AVC/HEVC in several ways:

  • In VP9/AV1, each frame is complete. There is no such thing as Slices, only Tiles. As a consequence, there is no high-level abstraction like NALUs for HEVC (and AVC and derivatives) and frame fragmentation over RTP is simpler but less robust to packet loss.
  • The state of a VP9/AV1 encoder or decoder can be represented by the set of reference frames plus the arithmetic context. There is no need like in AVC/HEVC to maintain which parameter set (SPS, PPS, …) is active.
  • In VP9/AV1, a “Frame parallel decoding mode” allows to parse frames in parallel.
  • The arithmetic coding is way simpler than AVC/HEVC. The state of the arithmetic coder can be duplicated for a group of frames.
  • Reference Picture Set (RPS) is simpler in VP9.
  • There is no anti-emulation as required in AVC/HEVC. This simplifies the decoding process but at the cost of not being able to transport VP9 or AV1 on MPEG-2 TS.
  • The pixel reconstruction phase of VP9 requires more hardware surface than HEVC, because VP9 uses numbers with higher precision requiring more adders. And the transform coefficients are ordered in a less predicable way in VP9 (which will result in more hardware surface).
  • There is no such thing as VP9 or AV1 File Format (AVC/HEVC has raw, Annex B, and canonical/MP4). One needs to use the IVF File Format or WebM.

In my view, the VP9 specification shines by its simplicity, but there is one concept that seems weird to me (although I did not follow all the codec history): Superframes. The Superframe concept seems to be a workaround on a B-frame patent. A VP9 encoder first produces a frame that won’t be displayed now, packed with a frame that is displayed now, and the encoder produces an almost empty frame (skip) that only references the previously non-displayed frame. What is also weird is that the header of the Superframe is at the end of the frame (bug filed for AV1). A possible explanation is that Superframes were introduced when some decoders were already out and that was the way to avoid breaking those decoders.

Encoding considerations

VP9 and AV1 have one strong problem: the encoding process may lead to integer overflows and these overflows cannot be predicted by the encoder before reconstructing the local coding block. Hence, in the worst case, if the overflow happens late in the frame encoding, the encoder might need to re-encode the entire frame. This may be ok for offline encoding, where multipass is used anyway, but this is unacceptable for real-time encoding. More specifically these overflows may occur when encoding residuals. Practically you may overflow at all intermediate values so you need to check them all. If a check fails, you have to re-encode with another quantizer otherwise the 18 bits intermediate registers used for quantizer at the decoder will overflow. So, by making life simple to decoders with 18 bits intermediate registers, VP9 and current AV1 make encoders life much more difficult because some possible combinations are de-facto impossible to get conformant.

Note that there is one bug tagged as ‘won’t fix’ in VP9. The funny fact is that the issue was present in AVC but solved in HEVC (so the VP9 issue was likely imported from AVC).

Conclusion

This part 1 focused on the VP9 and AV1 bitstream specifications. We like VP9 and AV1 for the simplicity of the design and bitstream format. But at the moment these codecs are not suitable to compete with HEVC for live for bitstream reasons. VP9 was frozen in 2013 but AV1 is still a few months away from being bitstream frozen (planned in Q1 2017). We hope AV1 contributors will keep on the effort and provide us with a strong competitor for HEVC.

Stay tuned for Part 2 where I’ll talk about considerations such as the factors of success of a codec, patents, standardization, and deployment considerations.

And of course feel free to share and comment this article 🙂

12 comments on “A view on VP9 and AV1 part 1: specifications”

  1. Kieran Kunhya Reply

    There is no anti-emulation as required in AVC/HEVC. This simplifies the decoding process but at the cost of not being able to transport VP9 or AV1 on MPEG-2 TS.

    If they just mandated one frame per PES they could get away with this. Or they could just accept the chance of emulation which is only really an issue with resilience during loss of MPEGTS packets.

    Kieran

    • Romain Bouqueau Reply

      Hi Kieran,

      Thanks for your input. You are right, those are choices that need to be made.

      About anti-emulation in HEVC, I found something really ugly in the slice_segment_header while writing this article. It was a bit out of scope here, but I’m sure that you and a few readers will understand the consequences of this poor choice (only one in the whole standard):

      entry_point_offset_minus1[i] plus 1 specifies the i-th entry point offset in bytes […]. When present, emulation prevention bytes that appear in the slice segment data portion of the coded slice segment NAL unit are counted as part of the slice segment data for purposes of subset identification.

      Romain

  2. Gary Hughes Reply

    The Superframe seems very similar to the packed bitstream idea that is sometimes used
    to carry B pictures in AVI files. Since AVI files do not support picture reordering (no concept
    of Decode TimeStamps), the B picture is packed in with another picture and a subsequent
    empty picture is used to trigger the presentation of the B picture.

    If you google “Divx packed bitstream” you should find more details.

    gary

    • Romain Bouqueau Reply

      Hi Gary,

      Thanks for your message. Two reviewers of the article made the same remark. So I’m glad you wrote about it in the comments 🙂

      There are two reasons why I chose not to mention it:
      1) I wanted to keep the article short: the first version was way longer so I had to cut large parts (and also split the article into two parts).
      2) Although the technical mechanism is the same, the motivation was different (as you rightfully mention): AVI made it because PTS reordering was not allowed. But Google did it for another reason and I still think it is to walk-around a B-frame patent.

      The mechanism plus the signalling at the end of the frame was weird-enough for me to consider Superframes error-prone (from a bitstream format perspective). Beyond what I described, a decoder implementer has to parse the content backward (i.e. from the end if the frame) with a possibility of bitstream ambiguity avoided by adding a fake final byte ; from the VP9 bitstream specification:

      it is a requirement of bitstream conformance that the final byte of a coded frame must not contain a superframe_marker.

      I hope AV1 will make this clearer,

      Romain

  3. Pingback: A view on VP9 and AV1 part 1: specifications - ...

  4. David Ronca Reply

    It seems a little premature to talk about the lack of NALU structure or emulation later in AV1 as a given, since the codec spec is not complete. VP9 cannot be extended due to the lack of a system-level syntax. For example, you cannot carry HDR mastering metadata in the bitstream. This was (IMO) a gross oversight, and the same mistake cannot be made with AV1. That is, AV1 must have a bitstream structure that supports extensiblity and lossy transmission (i.e. M2TS). This means AV1 needs a system-level syntax including start codes and thus emulation-prevention.

    • Romain Bouqueau Reply

      Hi David,

      Thanks for your input.

      Agreed: since Superframes parse backward, VP9/AV1 now needs a more sophisticated mechanism to be extended. About Emulation, please read the message from Kieran Kunhya above and my reply.

      Doing it right the first time is very difficult. Some of the issues raised about VP9 (and the current AV1 codebase) in this article are quite important. People raised concerns about MPEG being too slow to standardize, but AVC and HEVC are almost bug-free despite their complexity (beside a few drawbacks I’ll talk about in the next article). The first version of AV1 is coming soon (Q1 2017) and I hope that’s enough time to make it right.

      Romain

  5. Alan S. Davis Reply

    Thank you very much for your work. When do you plan to release part 2?

    • Romain Bouqueau Reply

      I plan to release part 2 at mid or end of August. The article is almost here, just a few editorial changes to be done.

  6. Pingback: A view on VP9 and AV1 part 1: specifications - ...

Leave a Reply

Your email address will not be published. Required fields are marked *