HEVC Decoding on the RPI4 Optimized With arm64 Assembly Code
– The luminance information of a single frame from the Big Buck Bunny open film. Decoded using the RPI4 HEVC decoder (v4l2-request-api) and converted to the correct layout using arm64 assembly code.
As some of you might recall I discussed several APIs in the previous blog post. Admittedly I focused more on libvaapi and the ffmpeg API in that post but also invested a few hours on attempting to utilise the Raspberry PI 4 video encoding and decoding hardware. Unfortunately with success limited to 1080p and h264.
Having revisited the topic of HEVC decoding on the Raspberry PI 4 recently, I stumbled across a few limitations and attempted to resolve a few of them. In fact without these changes a 64bit linux distribution on the PRI4 will not be able to decode 2160p HEVC content with more than ~7fps, which of course does not satisfy most use cases. Jumping a bit ahead I had to implement an arm64 version of the SAND layout to regular frame conversion in order to achieve above 30fps. While still not having solved all of the challenges and limitations with this approach, it might get us closer to having a good HEVC experience on the RPI4. Follow the discussions about the proposed improvements on GitHub: av_rpi_sand8_lines_to_planar_y8/c8: Add arm64 assembly implementation. (Update on the 18th of January: The changes have already been accepted and are merged to the drm_prime_1 branch.)
But before jumping ahead to the solution in detail, let’s discuss the actual steps on how to enable the HEVC decoding first.
Note: This is a significantly rewritten and extended version of an article I published about two weeks ago on this blog. You can check out the original version here: Web Archive: HEVC Decoding on the Raspberry PI 4/arm64. This new article focuses on not only accessing the hardware decoder but attempts to improve the software side be resolving some of the performance limitations. Nevertheless it also contains the compilation instructions for ffmpeg and required configuration changes mentioned in the original blog post.
HEVC Hardware Decoder
As discussed in one of the previous blog posts, the stateful V4L2 API can be used to decode and encode H264 video content with resolutions up to 1080p. This works well also on 64bit operating systems and is supported by upstream ffmpeg. You probably noticed the decoders and encoders with an _v4l2m2m postfix in their names. The new HEVC decoder has to be accessed by different means. Fortunately there is a linux kernel module called rpivid, which makes the functionality available through the stateless V4L2 request API. Distributions shipping a recent enough kernel driver, already include this kernel module. I used Ubuntu 20.10 64bit for these experiments, but the 20.04 release should be fine as well.
Additionally to having the kernel load the correct module we also need a patched version of ffmpeg. The upstream release doesn’t yet include support for the V4L2 request API and we also need additional Raspberry PI specific features. All of this is available in a fork on Github: rpi-ffmpeg/dev/4.3.1/drm_prime_1.
Compiling this release can be achieved with the instructions below. The important steps are preparing the correct kernel headers and configuring the correct build options. Especially important is also the –enable-sand flag. If it isn’t enabled, ffmpeg will compile but won’t be able to handle the decoded frames correctly. This is because the HEVC decoder produces frames with a certain layout we will discuss in more details further down in this post.
Enough said, here are the instructions:
Limitations and Performance
Thanks to the great works of the community the decoder can now be used through the ffmpeg utility and it’s libavcodec framework. Unfortunately I initially couldn’t decode more than 7 frames per second with my sample Big Buck Bunny (3840*2160 resolution, 8bit) video file. Running the perf profiler pointed to the av_rpi_sand_to_planar_y8 and av_rpi_sand_to_planar_c8 functions. Those fully utilized a single cpu core. Meaning the HEVC decoding worked just fine, but some boiler plate code is actually using up the available resources.
What I wasn’t aware of is that most likely due to optimizations the HEVC decoder stores the frames in a different format in memory. The community has called this layout SAND (remember the –enable-sand option in the build script?). It basically splits the image into columns of 128 bytes and then stores those consecutively in memory. Of course some stride values also come into play.
The diagram below shows how the SAND layout with column width 128 looks like.
The frame data is stored using the YUV color encoding system. Meaning we have two buffers to decode. Both use the layout shown above, although when decoding the chrominance information, we have to split the values into U and V pixels. Also important is that the frame uses the YUV420 encoding, meaning both chrominance channels operate at a quarter of the resolution (half the width and height).
For those who would like some test data, I actually dumped a frame from the Big Buck Bunny film for my own tests. The files are available on: imgur.
To further process the frames a conversion to a planar frame layout is often necessary. Ffmpeg already is able to convert these frames as explained above. Unfortunately it uses a slow C based implementation on arm64. The RaspberryPI foundation has implemented an optimized version for arm32, but arm64 has to fall back to the slow default implementation. In order to improve the experience on arm64 I wrote an assembly version of this algorithm which I want to describe in more detail down below.
Now since I don’t write assembly code frequently, and in those rare cases usually for x86_64 I needed to gather some details first. Due to time restrictions I didn’t favour reading and working through a whole text book on arm64 assembly. This is why I used some online resources instead to quickly get the results I was looking for.
- A good introduction to aarch64 assembly development has been posted by Roger Ferrer Ibáñez over on his blog in a series of posts.
- Wikipedia has a good overview of the calling convetions and descriptions of which register we can use: Wikipedia: Calling convetion#ARM_(A64)
- The official documentation of the calling convention is available on ARM’s website.
- A nice cheat sheet with an instruction overview has been compiled by the University of Washington: cs.washington.edu
- A short presentation created by Matteo Franchin (Arm Ltd.) also gives an interesting quick overview.
- A blog post by Mathieu Garcia gives some input on how to use the vector instructions.
- And for the other cases where I didn’t know how to move on, I simply wrote a simple piece of C++ code and compiled it on godbolt.org. Simply set the compiler to “armv8-a clang 11.0.0” and add the argument “-O1” to enable basic optimizations. The resulting assembly code has the exact syntax (gnu assembler) which I needed to expand ffmpeg.
Some other notes: Be careful with what registers you use. The wikipedia article is quite good, but some other blog posts I came across during my research had conflicting details. In general you can use X9-X15 without any restrictions, X8 if you don’t have a return type, and X0-X7 if you don’t need to keep the arguments passed to your function. Other registers might be possible to be used, but have to be backuped onto the stack first. Vector registers V0-V7 and V16-V31 can also be freely used. V8-V15 have to be backuped onto the stack as well. Also interesting is that one can access the Xn 64bit registers as 32bit registers by simply using Wn instead. In those cases the lower 32bits are used and the instructions set the higher 32bits of the destination register to 0.
Converting the SAND Layout with Assembly
To ensure that I fully understood how the conversion works, I reimplemented the algorithm in C++. I chose C++ simply because that’s what I use at work and I’m most familiar with. After I had a well looking implementation I transferred exactly this logic to assembly code.
The original C++ code for the luma conversion:
And it’s equivalent in arm64 assembly:
The implementation for the chrominance conversion looks almost exactly the same, except for the different vector instruction and some bit-shifts to accommodate for the interleaved UV values. The implementation is also shown at the bottom of the source code shown above.
Verdict and Open Questions
Now obviously the important question is how well this performs. I currently reach around 40fps when decoding the Big Buck Bunny film. Compared to the initial 7fps this is a significant improvement. Although I don’t expect this to be the final answer to HEVC decoding. I just sent these changes as a pull request to the ffmpeg fork maintainer, to receive some feedback. Once some news are available I will of course update this blog post. (Update: And only a single day later the changes have already been merged ;).)
Other questions are then afterwards of course what to do with the decoded frames. I’m interested in video transcoding, and the Raspberry PI 4 only has a 1080p h264 encoder. Meaning scaling the image down is a necessity. Obviously the CPU isn’t fast enough to do this well, meaning we would have to utilise the built-in ISP. Unfortunately I haven’t seen any progress in API support in that area yet.
Stay tuned for future updates on this topic, and hopefully other interesting topics on this blog ;-).
Update on the 11th of February
I spent some time on speeding up 10bit/hdr video decoding:
With these changes video files are current decoded and converted to planar frames with 16-18fps at 2160p. Still not fast enough for uhd transcoding but it should be good enough for lower resolution video. For playback alone accroding to jc-kynesim’s answers in the link above, the video output can be passed to the DRM subsystem directly, without the expensive conversion. Unfortunately for transcoding it looks like we might have reached the limitations of this SoC.