CS:APP--Chapter05 : optimizing program performance (part 1)

CS: APP--Chapter05: optimizing program performance (part 1)


prologue

The primitive objective of a programmer is to make it run correctly even fast. Not only do we ensure others can make sense of it, but also others can understand the code during review and when some modifications are needed.

Several types of activities for optimizing the program are necessary :

  1. select an appropriate set of data structures and algorithm
  2. write code in a way that we assist the compiler to turn source code into an efficient executable code
  3. parallelism ( detail in chapter 12 )

This chapter focus on one ultimate goal - how to make code run correctly and faster via several different types of program optimization. Eliminating the unnecessary work that has nothing to do with the processor and executing the divided work in parallel that highly depends on the architecture of the processor make our code run faster. For increasing the degree of parallelism with which we execute our code, more information will be discussed in chapter 12.

And a linear process of applying a series of transformations to code in a particular order is my learning objective.

A good way of gaining a better understanding of optimization is by looking into the assembly language with the optimization level Og and O1 and even higher.

1. capabilities and limitations of optimizing compiler

Here, a compiler with a growing ability of optimization would not optimize code in an aggressive and radical way due to its constraints. The compiler must adopt a safe optimization.

1.1 memory aliasing

此处输入图片的描述

In many cases, the probability of referring to an identical memory makes the compiler runs with a safe optimization.

An example is proposed on page 535 that adding it to itself, which runs as same as multiplying it by 4 with fewer operations at the assembly-language level. But the compiler cannot optimize the first case directly to the second case, because the compiler doesn't ensure whether or not they refer to the identical memory leading to the erroneous result.

1.2 function calls

此处输入图片的描述
此处输入图片的描述

another optimization block here is optimizing multiple calls of one function to fewer times of calls for the function may leading to side effects here. If there is a global variable, it must be modified several times to change program behavior named by side effect.

In this case, compile assumes the worst case and just leaves the function calls intact(unchanged).

Even though many types of optimization exploit the program to their full extent, the constraints of compilers limit the set of optimization.


Inline substitute (inlining)
Substituting function calls by the body of the function can improve the performance when the function is simple. It may bring on the problem as the code is being debugged because the trace for the function will be lost as well as profiling when the function with inline substitution will be profiled incorrectly.

2. expressing program performance

Like the metric throughout and latency in the last chapter, a metric "cycles per element" is introduced here, abbreviated by CPM and Especially for loop behavior.

2.1 cycle per element versus pico&nano-second

The time required by the procedure is characterized as a constant plus a factor proportional to the number of elements processed.

2.2 CPE versus CPS

In comparison with the metric "cycles per iteration", CPE focus on the performance of the whole procedure rather than the repeated parts.

-> least squares fit to find the expression for example 60+35n

3. program example

My understanding of optimizing programs:

  1. identify where the inefficiencies are
  2. a series of transformations implemented in a particular order

prerequisite:[malloc() && calloc()4

An example of vector abstract data type is demonstrated :

此处输入图片的描述

To gauge the performance, CPE is used to reveal how long each procedure takes in terms of data types and operation. One point emphasized at the beginning of this chapter is that selecting an appropriate data type and operation such as shifting a combination of arithmetic operations to shift operations can make a big difference in program performance.

此处输入图片的描述

One way of handing control on to the programmer is specifying the optimization level from Og with the basic optimization to O2 or even higher. A big leap from O0 to O1 as the option of optimization level.

4. eliminating loop inefficiencies

Observe the test condition of the for loop comprising of the call of function vec_length, every time the test condition is executed, the call of vec_length gets executed. And the value of vec_length does not change as the loop proceeds.

step description
1 long length = vec_length(*vect_ptr);
2 i < length

In this case, the computation for some variables is moved to an earlier section of the code that does not get evaluated as often. This kind of optimization is known as code motion.

Note: As a result of the limitation of a compiler for safe optimization, the compiler won't perform in an aggressive way. Then we need to assist the compiler to turn source code into effective machine code. On the other hand, some types of ambiguities can be removed by writing code in a smart way.

inefficiencies limitation of compiler our implements
call function as the loop proceeds compiler cannot determine whether or not this call has side effect. code motion ? loop unrolling?

5. reducing procedure calls

the sources of inefficiency are listed below based on combine1():

  1. every time vec_length() of test condition of for loop is called as for loop proceeds.
  2. retrieving the element in data array involves the boundart checking every time

solution for case 1

code motion for length moved to an early block of code that is not evaluated as often.

solution for case 2

bounds checking seems unnecessary in this case because all possible elements in the test condition are valid.

The result turns out to be different than we have thought, both implements have almost the same performance indeed. But at the algorithmic level, the latter case should perform better than the original one.

6. eliminating unneeded memory references

Up to this point, more understanding for assembly language is demanded. For combine3() over here, the value designated by the pointer desk is read after being written into at the end of the last iteration. This reading and writing is so wasteful because both values are definitely equal.

within loop block:
Thanks to the property of movement operation with not only computing effective address but also perform arithmetic operation. Besides, the introduction of one local variable just accelerates the whole process.


The demonstration above focuses on the optimization of program behavior, but the succeeding section will focus on processor architecture to find more for further improvement.


7. understanding modern processors

Unlike the optimization over the program itself, seeking the optimization, by exploiting the processor to its full extent needs, to understand the process architecture.

As same as the difference between a high-level program and an assembly program, the processing of what source code wants to achieve originally is far different from what the processor actually implements at the machine level.parallelism provides a special view to run several instructions simultaneously rather than a uniform series of stages from the updating pc to write back.

After it, two different lower bounds characterize the maximum performance of a program.

  • latency bounds :
  • throughput bounds :

From my own perspective, it is equivalent to the concept: a barrier.

7.1 overall operation

Modern processors more complicated than the pipeline processor in chapter 4 is described as superscalar because they can execute loans of instruction on every clock cycle and particularly out of order( different from the order of the machine-level program. )

此处输入图片的描述

7.2 functional unit performance

latency time : the number of clock cycles required by a instruction

issue time : the number of clock cycles between two successive independent instructions of same type

capacity : the number of function units capable of executing one instruction

The modern processor provides many types of function units for such as interger addition and multiplication as well as float operations, it expedites the better use of computer resources. Breaking data dependency and then executing them in parallel is an interesting topic will talk in detail later.

Due to the program behavior and the fixed computer system, there are two bounds proposed here :

latency bound: the minimum number of clock cycles for CPE required to execute one function in a strict and sequential order
throughput bound: the minimum value for CPE based on the maximum rate at which a procedure executes.

The next several sections mainly focus on how these bounds affect the performance of the procedure and then how to hit these bounds and achieve a better performance further.

7.3 an abstract model of processor operation

A tool of analyzing the performance of one program at the assembly-language level is named data-flow. It mainly shows the data dependency among the registers and finds critical path that it is the limit source for lower performance.

此处输入图片的描述
Two points need mentioning here are the primitive operation also called as micro-operations generated from instructions and the four categories listed below.

name description
read-only without any modification during loop and as the source operand on operations
write-only as the destination of data-movement operations
local can be read and written but without any data dependence between one interaction and the next iteration
loop except to the data dependence between two successive operations

7.3.1 critical path

We assume that all branch predictions are taken and then it continues looping.
the critical path is formed among the loop registers by observing the figure below:
此处输入图片的描述

Note: the critical path is the main limiting resource.

7.3.2 other performance factors

Rather than the work before and after loop, some factors like the number of function units and the number of data transferred between these operations.
Often represented by K in the expression \(L \cdot n + K\) where L is the latency

posted @ 2023-01-09 00:52  44636346  阅读(49)  评论(0)    收藏  举报