Is Managed Code Slower Than Unmanaged Code?
Ask anyone the question above and they will say that managed is slower than unmanaged code. Are they right? No they are not. The problem is that when most people think of .NET they think of other frameworks with a runtime, like Java or Visual Basic; or they may even think about interpreters. They do not think about applications, or what they do; they do not think about limiting factors like network or disk access; in short, they do not think.
.NET is not like those frameworks. It has been well though out and Microsoft has put a lot of effort in making it work well. In this article I will present some code that performs some computationally intensive operation and I will compile it as both managed C++ and unmanaged C++. Then I will measure how each of these libraries perform. As you will see, .NET is not automatically much slower than unmanaged code, indeed, in some cases it is much faster.
Fast Fourier Transform
Data that varies over time (for example, music) will be the combination of various frequency components. A Fourier Transform will convert the time-varying data into its frequency components. I came across Fourier Transforms because I spent six years as a research scientist performing spectroscopy experiments. One experiment I performed produced an interferrogram, that is, the sample under investigation produces a response over time when an interference pattern generated from white light is shone on it. The interferrogram is the response over time but the information required was the response of the sample to the frequency of the radiation. So, a Fourier Transform was taken of the time-varying data to yield the frequency-varying response. I was performing these measurements in 1993, and at that time a PC running DOS was just about fast enough to allow me to take the time-varying data, perform the Fourier Transform and display the frequency based results, all in real time, that is, as the measurements were taken. The limiting factor in this program was the Fourier Transform routine, because it involved so many calculations.
A Fourier Transform on N points will involve 2N2 computations, so if you have a thousand data points you will perform two million computations. There is a lot of theory about Fourier Transforms, and I will not go into details here, but the theory has lead to a routine called the Fast Fourier Transform (FFT) that through careful data manipulation it will generate a Fourier Transform that will involve performing just 2Nlog2N operations. For example, if you have a thousand data points then using the FFT you will perform 20,000 computations. The FFT routine still involves performing some trigonometric calculations, and it involves many numeric and array operations. Although the FFT is optimized compared to the Fourier Transform, it is still a computationally intensive calculation and is a good routine to exercise the performance of managed and unmanaged code.
There are many algorithms available, and if you perform a Google search for FFT you will get many thousands of hits. I chose to use the Real Discrete Fourier Transform by Takuya Ooura mainly because the code was clear and easy to change. I made four copies of Takuya's code: for unmanaged calculations, for managed calculations using Managed C++. C++/CLI and for C#.
The only change I made to the unmanaged library was to change the name of the function to
fourier and add
__declspec(dllexport) so that this method could be was exported from the library. The Managed C++ code involved a few more changes, but these were relatively minor:
methods parameters were changed to take managed arrays, and the routines used the
Array::Lengthproperty rather than requiring a separate length parameter
trigonometric routines were used from the
the routine was exported as a static member of a public class
The managed C++ code was then converted to C#, which involved fairly minor changes (mainly in the syntax to declare arrays). Finally, the managed C++ code was converted to C++/CLI, this involved a little bit more work, again, mainly because of the way that arrays are declared. The C++/CLI code is compiled as
/clr:safe because it does not use any unverifiable constructs.
The test harness called all four routines and timed each using Windows high performance timers through the framework
Stopwatch class. The test data for each measurement is a cosine; I used this to check that the routine worked because the FT of a cosine will be a single spike. The test harness process takes two parameters, the first determines the number of data points that will be tested, and the second number is the number of repeats that will be performed. FFT routines work better if the number of data points is a power of two, so the number you give for the first parameter is used as the power (and must be less than 28). Each routine is performed a single time without timing so that initialization can be performed: the library is loaded and any JIT compilation is performed. This is important because I am interested in the time taken to perform the calculation. After initialization the calculation is performed within a loop and each timing is stored for later analysis. From these timings the average time is calculated and the standard error is calculated from the standard deviation. The standard error gives a measure of the spread of values that were taken.
Occasionally a rogue time will occur (perhaps due to the scheduling of a higher priority thread in another process) and these will have an effect on the mean. For a Normal Distribution the majority of values should be within one standard deviation of the mean. So I treat any value outside of this range as a rogue value. Of course, the mean and standard deviation are calculated using the rogue value(s), but their effect will be minimized if large datasets are used. So once the code has calculated the mean and standard deviation it goes through the dataset and removes values that are outside of the acceptable range and then the mean and standard error are calculated on this new dataset.
The C++ compiler will allow you to optimize code for space and for speed, so I have written a makefile that allows you to compile the libraries for both optimizations and for no optimization. The C# compiler also provides an optimization switch, but this switch does not distinguish between optimization for speed or size, so I just compiled the optimized library once. The results given below are for all of these options. There is a batch file that will call
nmake for each option and store the results in a separate folder.
The managed code uses a private assembly and so it is fully trusted. In this case much of the .NET security checks have been optimized away.
I performed two sets of tests on two machines with .NET 2.0. One machine had XPSP2 and had a single processor, 850MHz Pentium III, with 512Mb of RAM. The other machine had build 5321 of Vista and had a single processor, 2 GHz Mobile Pentium 4, with 1Gb of RAM. In each case I calculated the average of 100 separate FFT calculations on 217 (131072) data values. From these values I calculated the standard error from the standard deviation. The results are shown in ms. The results for the Pentium III machine are:
|Not Optimized||Optimized For Space||Optimized For Speed|
|92.88 ± 0.09||88.23 ± 0.09||68.48 ± 0.03|
|Managed C++||72.89 ± 0.03||72.26 ± 0.04||71.35 ± 0.06|
|C++/CLI||73.00 ± 0.05||72.32 ± 0.03||71.44 ± 0.04|
|C# Managed||72.21 ± 0.04||69.97 ± 0.08||69.97 ± 0.08|
The results for the Mobile Pentium 4 are:
|Not Optimized||Optimized For Space||Optimized For Speed|
45.2 ± 0.1
|30.04 ± 0.04||23.06 ± 0.04|
23.5 ± 0.1
23.17 ± 0.08
|23.36 ± 0.07|
23.5 ± 0.1
|23.11 ± 0.07||23.80 ± 0.05|
23.7 ± 0.1
|22.78 ± 0.03||22.78 ± 0.03|
As you can see the results for unmanaged code varies considerably for the optimization settings, the code that is not optimized takes 35% more time than the code optimized for speed for the Pentium III and 96% slower for the Pentium 4 machine. Performing this rough analysis on the values for managed code shows that the optimized code is barely faster than the non-optimized code. This shows that for managed code, the optimization performed by the compiler and linker has a relatively small effect on the final executed code, bear this in mind when you read my conclusions derived from these results. Interestingly, in these tests, there are few differences between the time taken for managed code optimized for space and speed, and that on the Vista machine code optimized for space runs quicker than code optimized for speed.
Note that the measurements of a particular optimization setting were taken at the about the same time, so the relative values between the different types of code (for example comparing unmanaged C++ with Managed C++) can be considered quite accurate. However, measurements for different optimizations were taken at different times (for example comparing not optimized code with speed optimized code), so the state of the machine is more likely to change between each process run than during the process run. Thus it is less accurate to compare results for different optimizations.
The results for C# code showed that there is little difference between C# and managed C++ in terms of performance. Indeed, the optimized C# library was actually slightly faster than the optimized managed C++ libraries.
Now compare the managed results with the unmanaged results. For non-optimized code the managed code is significantly faster than the unmanaged code, this difference reduces for code optimized for space, where the managed code is still faster. It is only when the code is optimized for speed that the unmanaged code is faster than managed code. The difference between unmanaged code and C# code is just 2% for the Pentium II/XPSP2 machine. There is also a 2% difference for the Pentium 4/Vista machine, but here the C# code is quicker.
Think of .NET in these terms. The .NET compiler (managed C++ in this case, but the same can be said for the other compilers) is essentially the equivalent of the parsing engine in the unmanaged C++ compiler. That is, the compiler will generate tables of the types and methods and perform some optimizations based on high level aspects like how loops and branches are handled. Think of the .NET JIT compiler as the back end of the unmanaged compiler: this is the part that really knows about generating code because it has to generate the low level x86 code that will be executed. The combination of a .NET compiler and the JIT compiler is an equivalent entity to the unmanaged C++ compiler, the only difference is that it is split into two components meaning that the compilation is split over time. In fact, since the JIT compilation occurs at the time of execution the JIT compiler can take advantage of 'local knowledge' of the machine that will execute the code, and the state of that machine at that particular time, to optimize the code to a degree that is not possible with the unmanaged C++ compiler run on the developer's machine. The results shows that the optimization switches in managed C++ and C# have relatively small effects, and that there is only a 2% difference between managed and unmanaged code. Significantly, C# code is as good, or better than managed C++ or C++/CLI which means that your choice to use a managed version of C++ should be based on the language features rather than a perceived idea that C++ will produce 'more optimized code'.
There is nothing in .NET that means that it should automatically be much slower than native code, indeed, as these results have shown there are cases when managed code is quicker than unmanaged code. Anyone who tells you that .NET should be slower has not thought through the issues.
The code for these tests is supplied as C++/CLI code and so it will only compile for .NET 2.0. Here