Writing this article while listening to Falco’s Vienna Calling brought up some memories of the software projects nurturing my interest in software optimization. The open source mp3 audio encoder lame is one of those and serves as a good motivational example for today’s topic. Made possible thanks to - at the time - innovative algorithms and a considerable amount of software optimization. More modern examples include projects like OpenCV, compiler frameworks like llvm or video decoders like dav1d. All of these projects have in common that they would be much less useful without optimization. In this blog post I’d like to share some of my thoughts on optimization approaches and the possibilities (GPU programming, vector instructions, parallelization, …) at our fingertips.

Optimization is applied in a lot of applications across a vast set of software development ecosystems. For the sake of keeping this blog entry manageable we assume that we talk about compute-bound, rather than memory-bound situations. We also limit ourselves to native code written in C or C++. Of course it looks like Rust will likely be a good candidate for future projects, but for the time being a lot of projects rely on the former two languages. The available tooling and their closeness to hardware make them reasonable choices for performance critical applications.

Profiling and Analysis

Finding the culprit is usually the first action one’s got to take. Profiling is probably the most useful tool to do just that. Nevertheless, using all available resources sometimes poses advantages. Available unit tests and log files often give a good hint on where to look in the profiler output, further reducing the analysis effort. Additionally unit tests are often useful in benchmarking software improvements as we go along with the optimization process. Of course the profiler can be run on the unit tests too. This helps us in verifying the performance impact of changes and pointing out potential drawbacks.

While in the past profilers often used to cost a lot of money, these days there are quite a few options available for free. Even Intel’s VTune is available for free on Intel systems. That said, since a lot of C and C++ developers work in Visual Studio let’s start with that first. To benefit from the profiler, or performance analyzer as it is called, debug symbols have to be enabled first. This is because we usually run the profiler on Release binaries, rather than the Debug ones. There is no point in optimizing something that the compiler does for us anyway. To enable the debug symbols in Visual Studio, follow the instructions below:

# Open the project properties and select the following options
C/C++ -> General -> Debug Information Format -> Program Databse (/Zi)
Linker -> Debugging -> Generate Debug Info -> Generate Debug Information (/DEBUG)

After changing the settings, the performance analyzer can be started from within the application menu. Once the profiler run has completed, Visual Studio will analyze the results and show them within the Visual Studio IDE. What I should mention, is that I am speaking about the full Visual Studio IDE, not VS Code. On Linux we need alternatives. What I tend to use is either Valgrind and perf. Valgrind is incredibly powerful and offers much more than just profiling. The performance profiling plugin is called callgrind and can be executed with the following command.

# compile your application with debug symbols: "-g -ggdb3"
valgrind --tool=callgrind --dump-instr=yes -v ./your-binary --application-args

What one should be aware of, is that valgrind works in slightly different way compared to other profiles. It executes all analysis in a virtual machine, which will cause the absolute performance to decrease significantly. On the plus side we get very detailed results. Those results will be stored in a *.callgrind file, which can be visualized in the open source application kcachegrind.

A less feature-packed but faster alternative is perf. It is a command line application utilizing built-in linux kernel tracing capabilities:

# define the sampling rate with -F
perf record -e instructions:u -F 2500 ./your-binardy -application-args
perf report

Optimization Possibilities

With the initial profiling and analysis having been completed, the major bottlenecks should be known. There are multiple options at our disposal to improve the performance, all varying in their complexity, effort and potential benefit.

Step I: Algorithms

Starting with the likely most crucial step in optimization. After all an inefficient algorithm with a beautiful implementation might still end up performing horribly. Therefore it is important to analyze the algorithm first and decide which steps to take afterwards.

While discussing a specific algorithmic problem, one could probably have endless mathematical discussions in countless areas of expertise. Unfortunately one usually only has a limited amount of time available, therefore we need to have some more approachable solutions. The list below accumulates a few important aspects of algorithm analysis, optimization and implementation I found useful in the past.

  • One of the most important tools for analysis is the Big O notation (Landau-Symbols). It allows us to to define the running time behaviour depending on the given input size. Going through the algorithm and determining the Big-O function of the sub-routines, statements, loops and utilized containers will often point out inefficient sections.

  • When utilizing common containers and algorithms within ones personal creation, it is important to choose wisely. The STL for example comes with implementations for a lot of basic concepts most of us probably have learned about in an undergraduate algorithm class. The std::vector container is used very frequently in all kinds of applications. While often being the right option, it is important to know that it is an implementation of a dynamic array. Knowing that, we are automatically aware of it’s amortized constant insertion time. This also tells us that in order to avoid needless copy operations, we might want to pre-allocate memory if the problem size is known. If the vector doesn’t fulfill all the requirements, the STL ships with implementations for most basic data structures like lists, queues, hash-maps and others.

  • While the list of algorithms is in my opinion less impressive than the list of containers, there are some hidden gems as well. The std::sort function as an example implements intro-sort. An optimized version of quick-sort which falls back to heap-sort in order to guarantee a worst-case running time of O(n*log(n)). It also has an optimization for small data-sets, where it switches to insertion-sort.

  • In many cases the STL already provides a lot the functionality required, but if that is not the case the vast amount of libraries in the C/C++ community offer good options for a lot of problems. Eigen for linear algebra, OpenCV for computer vision, PCL for point-cloud processing and Ceres for solving non-linear equations are just a few examples.

  • Based on my experience it sometimes is possible to pre-compute data. Storing them in lookup-tables and loading them on demand might be an option where the computation can be avoided most of the time. Another approach which goes into this direction is dynamic programming where parts of the computation are avoided due to pre-caching results during computation if they are needed multiple times.

  • A mathemical approach to optimization is Linearization. It frequently is a way to speed up computation significantly. Unfortunately it also requires a lot of understanding for the underlying algorithm, but since it is so beneficial a lot of well-known algorithms already have linearized solutions available. I recall the problem of electric circuit simulation being solved with huge sets of linear equations. I also frequently work with SLAM (Simultaneous Localization and Mapping) algorithms which have linearized solutions.

  • I guess in 2021 I should also mention artificial intelligence and machine learning. Some problems might benefit from such an approach. Just be aware that you need a lot of training data and it will only work well with certain applications (e.g. object detection, classification in general, …).

Of course this list is just a starter but it might give at least some inspiration. Followed up with reading through existing literature, research papers and other open source projects should tell us whether there is anything possible in regards to optimization of the algorithm itself. Once we exhausted our options for the time being, we should move on the next step where we start looking at optimizing the implementation rather than the algorithm.

Step II: Multi-Threading

If an algorithm can be split into multiple work units, parallelization is a legitimate approach to improve processing time. In other cases where the algorithm needs to be executed serially, running it on multiple data-sets simultaneously might solve the given throughput issue too. The only risk in utilizing multi-threading is in synchronization and code complexity. Fortunately there are multiple libraries and frameworks attempting to solve this problem in relatively safe and elegant ways.

Methodology Limitations Notes
OpenMP compiler assisted parallelisation no support for asynchronous programming, thread lifetime controlled by the compiler quick results by using compiler directives, e.g. automatically parallelize loops
std::thread / pthread and others threads - manual thread management with all it’s pros and cons
Threading Building Blocks TBB task based multi-threading - looks like it is targeted at algorithmic projects, used also by other projects like OpenCV
libdispatch task based multi-threading C API, no official Windows support aka Apple’s Grand Central Dispatch
cppcoro task based multi-threading requires C++20 a coroutine library with loads of async programming features
boost/asio: thread_pool task based multi-threading small feature-set originally developed for the asio networking library
libazul task based multi-threading not yet battle-tested my own personal project, supports continuations, future aggregation and more

OpenMP is listed here because I’ve seen it used in academia where it allowed researchers to easily parallelize loops by using compiler directives. It even features support for some GpGpu platforms depending on the used toolchain. Its limitations are that it does not support generic asynchronous programming as well as manual control over thread creation and destruction. A truly interesting option is cppcoro which is based on the coroutines feature in C++20. Unfortunately that’s also it’s biggest issue: C++20 will still need some more time to become widely adopted in the industry. The remaining option is to implement a thread-pool. The basic features are relatively concise and even more complex ones like continuations are manageable. I also shamelessly put my own library into the list which does exactly that, implement a basic thread-pool with some additional functionality for asynchronous programming.

For the performance of the algorithm itself, the chosen thread-pool likely doesn’t have a big impact. Choosing a good framework nevertheless is a good idea, since at least I usually end up using the thread-pool for all kinds of additional features rather than just the algorithm parallelization.

Step III: Assembly and Vector Instructions

While the total processing time does improve with multi-threading, it does nothing for efficiency improvement. This is where other approaches come into play. Writing assembly code with vector instructions is such an option. It just so happens that this most of the time is not necessary since compilers like gcc, clang and msvc are all producing well optimized binaries. One situation where this might not always be the case, is when the underlying algorithm problem can be processed with vector instructions. Compilers obviously also make use of those CPU features but sometimes aren’t able to automatically generate equally efficient code. Especially in cases where it is necessary to slightly tweak how ones algorithm has to be executed, in order to work with vector instructions.

There are two disadvantages of assembly code: The first being the obvious increased complexity compared to more readable high-level languages. The other being its limited portability. The type of assembly code being written depends on the used assembler, the instruction set and the operating system. Fortunately there are some cross-platform assemblers available. With two prominent examples being the GNU assembler (gas) and NASM, although the latter is limited to x86_64. Let’s have a look at a short overview on assembly portability:

Windows (x86_64) Linux (x86_64) Linux (arm64) Android (arm64) macOS/iOS (arm64)
Calling Convention Microsoft x64 calling convention [1] System V AMD64 ABI [1] arm64 [2] arm64 [2] arm64 [3]
Gnu Assembler Support yes yes yes yes yes
MASM Support yes - - - -
NASM Support yes yes - - -

[1] The calling conventions used on x86_64 based platforms. [2] The arm64 calling convetion. [3] Apple mostly follows the arm64 specification with only a few minor differences.

Looking at the table tells us that we might get by, with three different implementations of which two are almost identical. The only situation where we definitely need more variations is when implementing functionality working with operating system specific features. Let’s say we would want to implement stack-full coroutines. In order to do that we would have to backup certain registers and swap-out the stack. This is done in the famous boost::context library. With our algorithmic focus we shouldn’t be in such a situation.

While the maintenance is as mentioned high and the implementation quite a bit of effort, it is definitely not unthinkable of using assembly code with vector instructions to improve algorithm performance. The amount of variations one has to implement is also manageable.

A slightly easier option is using compiler intrinsics. They make it possible to access the instructions without the need to write assembly code directly at the cost of loosing the ability to manually control register allocation. Compiler intrinsics are also architecture and to some degree compiler dependent. Therefore we don’t gain too much by using compiler intrinsics instead of writing assembly code but it is an option that one should consider as well.

Step IV: GpGpu Compute

Some algorithms perform particularly well on graphics hardware. This is due to the problem being well suited to be split up into huge numbers of work units. From experience: image processing projects frequently falls into this type of algorithms. I also successfully ported point cloud processing algorithms to different GpGpu compute stacks. The major disadvantage of this technology is its fragmentation. Perfectly proofing Randall Munroe’s comic on introducing new standards: xkcd.com/927.

Let’s have a brief overview of the most important available options:

Windows (x86_64) Linux (x86_64) Linux (arm64) Android (arm64) macOS/iOS (arm64)
Cuda Nvidia only Nvidia only Nvidia only - -
OpenCL yes yes [1] yes [1] - deprecated
SYCL yes [3] yes [3] - - -
Vulkan yes yes yes yes yes [2]
Metal - - - - yes
OpenGL (ES) yes yes yes yes deprecated
D3D12 yes - - - -
DirectCompute yes - - - -

[1] OpenCL support in case of Linux is strongly dependent on the GPU and drivers. On the desktop side all vendors at least have a proprietary driver that supports at OpenCL 1.2 or newer. In case of embedded systems it gets more complicated. Some examples: NXP’s i.MX8 platform supports OpenCL, Nvidia’s Tegra does not. [2] MoltenVK is a well working Vulkan implementation officially supported and developed by the Khronos group. Apple does not officially support Vulkan. [3] In order to catch up with Nvidia’s Cuda ecosystem, the Khronos group has released SYCL as a language to more easily write OpenCL based applications. Unfortunately the available implementations either only support a limited set of graphics vendors or are in an early state of development.

My personal experience was mostly based on OpenCL starting around the time when it’s future IMHO looked more promising. Apple stepping away from their own creation and slow adoption in the mobile space have definitely hurt OpenCL. I also utilized Cuda for some work due to targeting an Nvidia Tegra based system. Nevertheless truly portable options are rare. I had huge hopes in PoCL at some point, but that never took off on non-linux platforms. In the past, some projects I came across utilised fragment and vertex shaders through OpenGL with varying degrees of success. The only option that looks like it could become a truly cross-platform solution for GpGpu technology is Vulkan. While Apple doesn’t support it, MoltenVK is a high quality implementation of Vulkan on top of Metal. In some small-scale personal tests I managed to get a large project running on an iPad Pro using MoltenVK. The only issue having been that a single transpiled shader (opencl -> spir-v -> metal) did not compile on the iPad. A port to the Metal Shading Language (C++ with some additional keywords) resolved this issue.

As always, recommending a generic solution is impossible. Vulkan certainly has the best changes to become the most versatile solution, although the most likely development will be future frameworks abstracting multiple GpGpu stacks to provide actual portability. Currently the best decision is probably made on a per project basis. If portability is not an issue: Cuda or OpenCL aren’t bad options, otherwise having a look at Vulkan might be a solution.

Case Studies

In the last section I’d like to present some of the applications of software optimization and the results that can be achieved with reasonable effort. Of course the examples are limited to either open source software or publicly available academic work.

Video Transcoding on the RPI4 Decoding JPEG Point Cloud Processing on the GPU
Description The ISP produces images in the so called SAND layout. They need to be converted to a standard layout. [article] An outdated jpeg decoder I did for a university seminar about a decade ago. [original code] A part of my master’s thesis from quite a few years ago. Porting existing point cloud processing algorithms to a GpGpu framework. [thesis]
Issues The conversion is not done by the ISP but executed in software on the CPU. Several implementation issues and an incredibly ineffective bit-stream decoder. Especially slow algorithm performance due to the amount of data that needs to be processed.
Solutions ARM64 Assembly with Vector Instructions AVX2 Assembly, new Bitstream & Huffmann Tree Decoders A port to OpenCL with very few changes in the algorithm.
Results +470% speedup, decoding 40fps instead of 7fps (8bit 2160p hevc) ~2.5 times faster but still slower than libjpeg-turbo ~10 times faster

In the first project title Video Transcoding on the RPI4 I investigated the limited video decoding performance on said single board computer. Since I already described the project in this article I’ll just summarize it briefly. The ISP built into the Raspberry PI 4 produces decoded images in a layout called SAND. Transcoding the video into another format using a software encoder like x264 requires us to convert the image. Unfortunately all optimizations that went into this, have only been developed for 32-bit arm. Figuring this out was relatively effortless, since a quick run of perf immediately pointed out the functions within ffmpeg doing the conversion. Going through those it is obvious that the logic to transform the layout is relatively straight forward and their is no room for algorithmic improvements. While I was briefly thinking about using a GPU shader, the lack of good GPU drivers and the GPU not being known for high performance have not given me a reason to attempt it. I therefore implemented the conversion in arm64 assembly using vector instructions. The performance uplift in my brief tests has been very positive. While the improvements on their own had already been satisfying, the most rewarding part has been that these improvements were merged back into the upstream project!

With the second project I wasn’t sure whether I should include it here. The jpeg decoder is the result of a seminar I took about a decade ago at university and due to the lack of time and experience, it is definitely not a good piece of software. Nevertheless I still had a look at the code base. I also happened to still have an experimental port of the DCT algorithm to AVX2 assembly around, so why not include that as well? In the end it turned out to be a good example for implementation mistakes and algorithmic improvements. The AVX2 assembly code did not have such a significant impact, unsurprisingly the actual major improvements came from a new bit-stream and huffman tree decoder. I used to read out every single bit individually and issue millions of function calls in doing so. After changing this to another algorithm using lookup-tables a speedup factor of 2.5 was relatively easy to achieve. It is still slower than something like libjpeg-turbo, which has been iteratively improved for ages, but it isn’t to far away either.

The last project is also a few years old, but I wanted to include a good example for a GpGPU scenario. Point cloud processing was something I had no prior experience at that point, so completely rewriting the algorithm was not an option. Using assembly code and vector instructions might have been an option but it was expacted that the amount of point cloud data would grow in the future, so a larger speedup was required. I chose OpenCL as the technology because I was hoping it would become more widespread in the years that would follow this project. Although that hope was likely misplaced, the performance improvements in utilizing the GPU was quite good. The algorithmic problem was relatively easy to paralellize and the end result was about ten times faster than the original implementation running on high-end processors.

__

That’s it for today. I hope some of you find these notes helpful or inspirational. For any comments or questions have a look at the About page.