The High Efficiency Stream Protocol (HESP) is an adaptive HTTP based video streaming protocol which brings superior quality of experience for online viewers, while reducing the cost for scaling media delivery up to 20%. HESP enables sub-second end-to-end latency, as low as 400ms. And with zapping, start-up and seeking times well under 100ms, it achieves experiences better than existing broadcast solutions. The HESP Alliance is a collaborative community of industry leaders striving to further improve the quality of experience of online viewers and the cost-efficient delivery of online video with HESP solutions.  

AiaaS (Ai Ad as a Service) is an intelligent platform that powered AiLiVE & AiAds. It is integrated with Google’s DAI suite and Open Ai.

 Aiplayer, AiLiVE & AiAds are compabible Google for Google DAI usage.

The GOP size, or size of your Group of Pictures is one of the main encoding parameters that have a direct impact on video bitrate, video quality and an indirect impact on end-to-end latency. It determines how often a keyframe (or IDR frame) will be available. In LL-HLS, the player requires a keyframe to start decoding, meaning it can start the playback only at GOP boundaries. Longer GOPs cause higher start-up delay and higher latency.

Impact of GOP size on video bitrate

Apple’s recommended GOP size is 2 seconds. Typical LL-HLS implementations support LL-HLS with 3-second end-to-end latency when the GOP is set to 1 second. However, small GOP sizes come at the cost of higher bandwidth consumption. The smaller the GOP size is, the more frequent the keyframes would be. Depending on the video, keyframes can be 10 times larger than P frames and small keyframe intervals will increase the video bitrate and hence the bandwidth consumption.

In this table, you can see how video bitrate changes with different GOP sizes. For having a comprehensive view, four different types of videos have been tested:

A movie (Tears of Steal)

An animation (Big Buck Bunny)

A bike race TV program

A static screen streaming.

  • For this calculation, the parameter factor CRF (Constant Rate Factor) is kept the same for all GOP sizes forcing the encoder to keep the same video quality in all GOP sizes. As we can see in higher GOP sizes, we can keep the same video quality while using less bandwidth.
  • Bitrate reduction in large GOP size can differ depending on the type of video. For example, in the static video type (e.g. screen streaming), we have up to 70% reduction in bandwidth consumption from GOP 0.5 seconds to GOP 10 seconds. For other videos, we still have up to a 20% reduction in video bitrate.
  • Impact of GOP size on video quality
  • GOP size also has an impact on the video quality. The larger the GOP size is, the higher the video quality will be. Because for the same bitrate we can put more details in the P frames when the GOP size is larger.  
  • We studied how the GOP size affects the video quality. To measure the video quality, we use the VMAF metric. Below is a brief explanation of VMAF.
    What is VMAF?
  • Video Multi-method Assessment Fusion (VMAF) is a video quality metric designed by Netflix consolidating four different metrics:
  • Visual Information Fidelity (VIF): considers fidelity loss at four different spatial scales
  • Detail Loss Metric (DLM): measures detail loss and impairments which distract viewer attention
  • Mean Co-Located Pixel Difference (MCPD): measures the temporal difference between frames on the luminance component
  • Anti-Noise Signal-to-Noise Ratio (AN-SNR). VMAF score is ranged between 0 and 100 (100 being identical to the reference video). 6 VMAF points represent a noticeable difference. The VMAF default model is used in this test.  In the table below we depicted how GOP size affects the video quality in different video types. For each encoded video, a VMAF score in comparison to the reference video has been calculated. Depending on the video type, the VMAF score drop in lower GOP sizes is different. Except for the static video streaming, in all the rest of the videos, there is a significant VMAF drop between GOP 10 sec and GOP 0.5 sec. Big Buck Bunny drops by 8 points between GOP 10 sec and GOP 1 sec, which is a noticeable quality degradation. Please note that in this test the aim was to see the impact of different GOP sizes in the VMAF scores. All videos are encoded at max bitrate 4Mbps. There could be the case that the chosen 4Mpbs is not the highest VMAF scored bitrate for its resolution, but matching the highest VMAF score for each resolution in different videos is out of the scope of this test.

Based on the VMAF points we see that for some types of videos such as static screen streaming, the quality does not improve that much with large GOP size while you still gain a huge reduction in the bandwidth consumption in large GOP sizes (Table 1) On the other hand, for another type of video such as Big Buck Bunny, the video quality improves up to 15 VMAF points (GOP 10 seconds with respect to GOP 0.5 seconds) which is a considerable amount since every 6 VMAF points is a visually noticeable difference. We also have another pattern for the Tears of Steel video where the VMAF improvement is below 6 VMAF points (between GOP 1sec and GOP 10 sec). In this case, you still have ~20% bitrate reduction in the largest GOP size (Table1). 

Impact of GOP size on zapping time and latency

In LL-HLS live streamingif we increase the GOP size to decrease the bandwidth consumption and increase the video quality, we need to sacrifice the short zapping time and/or the latency.

The player requires a keyframe to start decoding, meaning that a large GOP will impact the zapping time and latency of the stream. It can either wait for the following GOP, implying a long startup time and low latency or it can start playback of the current GOP, implying short startup times, but potential latencies of up to the GOP size. Having large GOPs with only one keyframe every 6 seconds, for example, will mean that the player can start playback on a position once every six seconds. This doesn’t mean your zapping time will be six seconds, but it might require your player to start at a higher latency. With the 6 seconds example, starting playback immediately implies that the average additional latency at the start will be 3 seconds, and in the worst case it can reach up to 6 seconds.


Optimizing LL-HLS: How does HESP fare against LL-HLS?

HLS

The 4 key factors that affecting the quality of low latency streaming experience when utilizing Apple’s LL-HLS protocol. Additionally, we have also discuss the importance of GOP size and its impacts on the overall viewing experience and provide 4 recommendations that you can implement to ensure the best QoE or the best viewing quality.

  1. GOP size: determines how often a keyframe (or IDR frame) will be available. In LL- HLS, the player requires a keyframe to start decoding, meaning it can start the playback only at GOP boundaries. Longer GOPs cause higher start-up delay and higher latency.
  2. Part size: In LL-HLS, the player is not limited to start the playback at segment boundaries and can start the playback at every independent part (the parts that start with a keyframe). “The part size has a direct influence on the end-to-end Latency in LL-HLS. The smaller the part size is, the lower the latency will be. But it is not that simple.” Apple says that the parts can be as low as 200msec. But we need to keep in mind that in LL-HLS, the player must start the playback with a keyframe. If the part does not start with a keyframe (which is the case when part size is smaller than the GOP size), the player should either seek back to a point where a part starts with a keyframe or wait for the next keyframe to start the playback. For example, consider GOP size of 2 seconds, part size of 500 msec and playback request is sent at the middle of a 6-second segment. The player needs a keyframe for starting the playback. It must wait for the following keyframe in the next third part which means at least 1.5 seconds zapping time or seek back to two parts behind which will bring additional 1-second latency to the end-to-end latency.
  3. Segment size: The segment size in LL-HLS does not directly impact the latency as it does in traditional HLS. In general, it is nice to have longer segments that allow for larger GOP size which means higher video quality and lower bandwidth consumption. On the other hand, in LL-HLS large segment size impacts the amount of the parts which you need to list in your playlist. As a result, it affects the size of the playlist (and how much data must be loaded in parallel with the media data). Having long segments can as a result significantly increase the size of the playlist, causing overhead on the network and impacting streaming quality. Segments can’t be too small either since that imposes a smaller GOP size and therefore lower video quality and higher bandwidth consumption. 
  4. Buffer size, Network tolerance & ABR in Low Latency Streaming: There is always a trade-off between a secured smooth playback in all (network) conditions and achieving the lowest possible latency. To cope with network and other variations, LL-HLS maintains a buffer to handle the jitter and unforeseen hiccups in the video transmission. The larger the buffer, the higher the tolerance for network issues, but also the higher the latency. In LL-HLS we have a default of 3 part durations in the buffer. For example, when you have parts of 400ms, this will mean your buffer will target size of 1.2s. Based on our tests, and with correct settings for the part and GOP size, with slightly higher part size, for example, around 1 second, we notice that the buffer size can be slightly decreased without impact on user experience. However, as a baseline, it is envisaged never to have a buffer of fewer than 2 parts. But the network condition is not always perfect. Besides jitter, we also encounter drops and variations in the network capacity. To cope with this varying network bandwidth, ABR is needed. In order to make sure the ABR is working effectively, the buffer size should be long enough to be able to accommodate the quality switch, just in time before any glitch or rebuffering happening in the playback. Let’s consider the worst-case; If the buffer size is 2 seconds, the segment is 6 seconds, the GOP size is 3 seconds, and the network bandwidth drops to half of the video bitrate near the end of the segment. The player would need to download a new part from lower quality that starts with a keyframe. Because we are near to the end of the segment and the GOP size is 3 seconds, it means that neither the current part nor the previous part contains a keyframe and the player should download the third prior part to be able to switch the quality down. So, you would need to download 3 seconds of data while you have only 2 seconds of buffer. If you reduce the GOP size to 2 seconds, you may still get stalls during the ABR switch. Therefore, you need to increase the buffer size to make sure you can have a smooth quality switch.A larger buffer size means longer latency. You would think of reducing GOP size to smaller values to have a proper ABR switch down without stalling but as discussed earlier, smaller GOP size comes with lower video quality and higher bandwidth consumption which brings an extra challenge to the ABR itself.

We are going to explain how HESP works, how it differs from LL-HLS and their performance comparison against each other.

HESP is a next-generation online video delivery technology outperforming the current generation protocols for low latency streaming at scale. It is an Ultra-low latency streaming protocol delivered over HTTP/1.1 with Chunked Transfer-Encoding and Range requests with a minimalistic manifest with low-frequency update requirement. There are two complementary streams required:

  • The Initialization stream, which contains keyframes that makes it possible to start the playback at any given moment and not necessarily at the beginning of a segment or at a keyframe interval, and
  • The Continuation stream, which contains the IPB frames and can be played right after the keyframe from the Initialization stream.

HESP offers a broadcast-like experience with sub-second latency and zapping time on any device or platform. It also delivers very low bandwidth consumption compared to other ultra-low latency streaming protocols such as WebRTC. Being delivered over HTTP, it is compliant with standard CDNs and offers low-cost scaling. 

How does an end-to-end HESP solution work?

As mentioned above, HESP is based on using two streams for each quality/track: 

1.  Initialization stream to rapidly start new streams.

 2. Continuation stream for use in normal operation.

What is Initialization Stream?

The initialization stream consists of initialization packets corresponding to each frame position. The initialization packets are individually addressable. They contain an IDR frame corresponding to the frame position making it possible to start the playback at any given frame and they are contained in an ISOBMFF format.

What is Continuation Stream?

The continuation stream is packaged in CMAF-CTE, albeit with specific configurations for low latency and can start playback immediately after an initialization frame, allowing for very fast channel start and switch times. It is addressed using byte-range requests and is served using Chunked Transfer Encoding for low latency.

The segments in the continuation stream can be lengthy without any limitation for the low latency and fast zapping time, making it possible to have a large GOP size and hence lower bandwidth consumption and higher video quality.

HESP Implementation:

In order to implement HESP, only two components of the video value chain need tailoring:

  1. The packager
  2. The player
    • HESP works with regular encoders and also regular CDNs, as long as these support CTE and byte ranges.

Comparing LL-HLS to HESP:

HESP provides sub-second end-to-end latency together with large GOP sizes (10-12 seconds).Thanks to the initialization stream, the quality switch in ABR is not limited to the GOP boundaries and it can happen at any given moment. This means HESP is not limited to a small GOP size. Thus, the GOP size can be kept large while having a small buffer size (HESP has sub-second target buffer) and so it is possible to have low latency and smooth quality switch at any time without risk of rebuffering. 

By setting the same latency target as LL-HLS in HESP (~3sec) you would have more margin to encode the video more efficiently resulting in lower video bitrate for the same video quality and so you could save bandwidth consumption. 

As described earlier, LL-HLS cannot really exploit the small part size as there are also other consequences to be taken into account; no matter how small the part size is in LL-HLS, you are limited to the keyframe interval to be able to switch the quality in bad network conditions. In HESP, on the other hand, starting the playback is not limited to the GOP boundaries. Therefore, you do not need to sacrifice video quality (smaller GOP) to have the lowest end-to-end latency.  

While LL-HLS cannot really exploit the small part size to have low latency in bad network conditions, HESP offers a small buffer size, low latency, large GOP size, and higher video quality all at the same time.

Depending on the use case and the desired priorities (e.g. latency, bandwidth, consumption, video quality and network resiliency), encoding and packaging parameters, as well as buffer size, could be configured differently. Here we go through the most important parameters:

1. GOP: Set your keyframe interval to 2-3 seconds

Based on the explanation in our previous blog, small GOP sizes seem extremely attractive. However, if you have a lot of keyframes, it increases inefficiency in compression, which means you will use more bandwidth and streaming quality will go down for the same bitrate. This effect becomes large when GOP sizes fall below 2s. In case you are interested in lower bandwidth consumption and reasonable start-up time, the recommendation from AiC’s side would be to set your keyframe interval to 2 to 3 seconds. On the other hand, if your priority is to have small start-up delays and low latency, the GOP size should be smaller and should be set in a way that all parts start with a keyframe.

2. Part Size: Use 400msec Part Size in for the lowest end-to-end latency

As discussed previously, in an ideal world, the part size and the GOP size should be equal to have the least zapping time because in that case we have all parts marked as “independent” and the player can start the playback at any part boundaries. But having a smaller part size will lead to a lower minimum buffer size and so lower latency. However, too small a part size will cause overhead because of too many HTTP requests that should be handled. If you can guarantee the perfect network condition and your main focus is to have the lowest end-to-end latency, we recommend using 400 msec part size. If instead the network condition is variable and you need to have a smooth playback during network ups and downs and also benefit from extra-low zapping time, we recommend setting your keyframe interval and part size to 1 second as it strikes a balance between latency and viewer experience at start-up.

3. Segment Size: Set it equal to or larger than your GOP size

We’ve established that the segment size should be equal to or larger than your GOP size. It cannot be too small due to consequent poor video quality and it cannot be too large because of the LL-HLS limitations mentioned above. Apple’s recommendation for segment size is 6 seconds for LL-HLS which is a good balance between video quality and overhead in the network. In HESP you won’t have such limitations for large GOP size and long segments which leads to better video quality and lower bandwidth consumption.

4. Buffer Size, Network Tolerate & ABR: Find the best middle ground

For Low latency / fast startup streaming with LL-HLS, it is important to have a clear understanding of the impact of each parameter on the final result. End-to-end latency depends directly on the part size. On the other hand, the zapping time depends directly on the GOP size and it can not go lower than that even with smaller part size.

So the lowest latency you get from the smallest part size, but that does not bring the shortest zapping time necessarily (for example when the part is shorter than the GOP and it is not one of its divisors e.g. 1/2 or 1/3 or … of the GOP). Small part sizes (smaller than GOP) are not really helpful during the quality switch for ABR as the quality switch can happen only at independent parts which correspond to the GOP boundaries.

Therefore, the ideal situation to have the lowest zapping time and latency is to have the part and GOP size equal and as small as possible. A GOP size lower than 1 sec does not really make sense because of the poor video quality and high bandwidth needs, therefore a good value would be 1 second in order to achieve the lowest zapping time, latency and smooth ABR switches with 2 seconds buffer (2 parts). However, the GOP size of 1 second could be demanding for the bandwidth consumption. AiC’s recommendation would be a GOP size of 2 seconds with the part size of 1 second and buffer size of 3 seconds which is a good combination for reasonable video quality, bandwidth consumption, latency and zapping time.

Using the RTMP push URL and the stream key you have received when creating a channel, you can start streaming content to that RTMP endpoint.

RTMP pull

If you would like to do pull-based streaming instead of push-based streaming, you can also use your own RTMP pull endpoint and specify this when starting the channel in the next step.

The RTMP push URL generated by AiLiVE shows as RTMPS. If you would like to do RTMP streaming then you will have to change “rtmps” to “rtmp”. As an example

rtmps://rtmp.europe-west.aic.live/live

will have to be changed to

rtmp://rtmp.europe-west.aic.live/live

for RTMP-based streaming.