摘自: http://www.drdobbs.com/parallel/cache-friendly-code-solving-manycores-ne/240012736?pgno=1

"Cache" lines are the chunks of memory handled by the cache, and the size of these chunks is called the "cache line size." Common cache line sizes are 64 or 128 bytes.
The limited number of lines that a cache can hold is determined by the cache size. For example, a 64 KB cache with 64-byte lines has 1024 cache lines.

Some Cache Examples

The amount of cache per processor core is generally limited, but much faster than main memory. For example, the quad-core Intel Sandy Bridge (Intel Core i7) has 32 KB of L1 (or innermost) cache per core, 256 KB L2 cache per core, and 2 MB L3 cache per core. Accessing the L1 memory takes 1.2 ns, L2 3.6 ns, and L3 requires 11.7 ns, respectively. AMD's 8-core Bulldozer (8150) has 16 KB/1024 KB/1024 KB of L1/L2/L3 memory per core, respectively, and has slower access times of 1.1 ns/8.4 ns/24.6 ns. Accessing main memory requires about 60-70 ns with either processor [1].

Consider the following simple routine:

1
2
3
4
5
6
7
static volatile int array[Size];
static void test_function(void)
{
    for (int i = 0; i < Iterations; i++)
        for (int x = 0; x < Size; x++)
          array[x]++;
}

This routine was run on a system with a cache size (L2) of 512 KB. For comparison purposes, the size of the array was varied while changing the number of iterations to keep the number of increments to the inner array constant. When the array size was greater than 512 KB, it took about 50 seconds to run. When the array size was smaller, so it fit completely within the cache, the calculation ran in about 5 seconds. The magnitude of this effect, approximately 10x speed up, is consistent with the ratio of latency to the cache and latency to main memory.

Inefficient Loop Nesting and Blocking

已知多维数组遍历,inner-loop应访问连续元素。

Dense, multidimensional arrays are common datasets in numerical applications. When a program is processing a multidimensional array, such as pixel data in image processing or elements in a finite-difference program, it is important to pay attention to which dimension is assigned to the inner loop and which dimension is assigned to the outer loop.

If an array is traversed along the wrong axis, only one element will be used from each cache line before the program continues to the next cache line. For example, consider this small snippet of C code iterating over a two-dimensional matrix:

1
2
3
4
double array[SIZE][SIZE];
for (int col = 0; col < SIZE; col++)
    for (int row = 0; row < SIZE; row++)
        array[row][col] = f(row, col);

In one iteration of the outer loop, the inner loop will access one element in each row of a specific column. For each element touched by the loop, a new cache line is accessed. If the array is larger than the cache size, the rows from the top of the matrix may already have been evicted from the cache when the bottom of the matrix is reached. The next iteration of the outer loop then has to reload all of the cache lines again, each time paying the full main-memory latency cost.

If the nesting of the loops is changed so that they traverse the matrix along the rows instead of along the columns, the program receives a lot more benefit from the cache:

1
2
3
4
double array[SIZE][SIZE];
    for (int row = 0; row < SIZE; row++)
         for (int col = 0; col < SIZE; col++)
             array[row][col] = f(row, col);

As the program processes each column along a row, data is moved into the cache and all of the data in that line is used before the cache line ends up being discarded. Accessing data sequentially also allows hardware prefetch logic to engage, placing data in the cache before the processor even asks for it. This maximizes the use of available memory bandwidth.

Padding: The Bane of Cache Line Utilization

已知,结构体按字对齐

Moving Unnecessary Data

重点,将频繁访问的域从大结构体中移出。

Another type of poor cache line utilization can be seen in the following code snippet containing a vector of data objects, each being a structure with a status and a payload:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
struct object_def {
    char status;
    double object_var1, object_var2, object_var3, object_var4, object_var5 ;
};
 
struct object_def object_vector[Size];
 
void functionA()
{
    int i;
    for (i=0; i < Size; i++) {
        if (object_vector[i].status)
           do_something_with_object(object_vector[i]);
    }
}

The program periodically goes through all the elements of the vector and checks each element's status. If the status indicates that the object is active, some action is taken and the object variable is accessed. The status byte and the five object variables are allocated together in a 48-byte chunk. Each time the status of an object is checked, all 48 bytes are brought into the cache, even though only one byte is needed. A more efficient way to handle this would be to have all the status bytes in a separate vector, as shown below, which reduces the cache footprint and bandwidth needed for the status scanning to one-eighth of the previouus code example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
struct object_def {
     double object_var1, object_var2, object_var3, object_var4, object_var5 ;
};
 
struct object_def object_vector[Size];
char object_vector_status[Size];
 
void functionA()
{
    int i;
    for ( i=0; i<Size; i++) {
        if (object_vector_status[i])
           do_something_with_object(status[i],object_vector[i]);
    }
}

Performance improvements of a factor of two have been achieved for the SPEC benchmark suite Libquantum, and a factor of seven for the open-source application cigar when similar optimizations were applied.

Other Cache Issues

Other program behaviors that can limit cache effectiveness (and may represent an occasion for optimization) include:

  • Using random access patterns, such as linked lists
  • Using unnecessarily large and sparse hash tables
  • Fetching or writing data into the cache even though it is unlikely to be reused
  • Updating only a single byte of a cache line
  • Prefetching data that will never be used
  • Prefetching data into the cache too early or too late
  • Separating computations on large arrays into different loops when there isn't a data dependency that requires one loop to be completed before the other one starts
posted on 2012-11-15 15:37  Oceanedge  阅读(256)  评论(0)    收藏  举报