[zz]Shader programming tips #1

from humus's web

Shader programming tips #1
Thursday, January 29, 2009 | Permalink

DX9 generation hardware was largely vector based. The DX10 generation hardware on the other hand is generally scalar based. This is true for both ATI and Nvidia cards. The Nvidia chips are fully scalar, and while the ATI chips still have explicit parallelism the 5 scalars within an instruction slot don't need to perform the same operation or operate on the same registers. This is important to remember and should affect how you write shader code. Take for instance this simple diffuse lighting computation:


float3 lightVec = normalize(In.lightVec);

float3 normal = normalize(In.normal);

float diffuse = saturate(dot(lightVec, normal));

A normalize is essentially a DP3-RSQ-MUL sequence. DP3 and MUL are 3-way vector instructions and RSQ is scalar. The shader above will thus be 3 x DP3 + 2 x MUL + 2 x RSQ for a total of 17 scalar operations.
Now instead of multiplying the RSQ values into the vectors, why don't we just multiply those scalars into the final scalar instead? Then we would get this shader:


float lightVecRSQ = rsqrt(dot(In.lightVec, In.lightVec));

float normalRSQ = rsqrt(dot(In.normal, In.normal));

float diffuse = saturate(dot(In.lightVec, In.normal) * lightVecRSQ * normalRSQ);

This replaces two vector multiplications with two scalar multiplications, saving us a 4 scalar operations. The math savvy may also recognize that rsqrt(x) * rsqrt(y) = rsqrt(x * y). So we can simplify it to:


float lightVecSQ = dot(In.lightVec, In.lightVec);

float normalSQ = dot(In.normal, In.normal);

float diffuse = saturate(dot(In.lightVec, In.normal) * rsqrt(lightVecSQ * normalSQ));

We are now down to 12 operations instead of 17. Checking things out in GPU Shader Analyzer showed that the final instruction count is 5 in both cases, but the latter shader leaves more empty scalars which you can fill with other useful work.

It should be mentioned that while this gives the best benefit to modern DX10 cards it was always good to do these kind of scalarizations. It often helps older cards too. For instance on the R300-R580 generation it often meant more instructions could fit into the scalar pipe (they were vec3+scalar) instead of utilizing the vector pipe.

Shader programming tips #2
Wednesday, February 4, 2009 | Permalink

Closely related to what I mentioned in tips #1, it's of great importance to use parantheses properly. HLSL and GLSL evaluate expressions left to right, just like C/C++. If you're multiplying vectors and scalars together the number of operations generated may differ a lot. Consider this code:

float4 result = In.color.rgba * In.intensity * 1.7;

This will result in the color vector being multiplied with the intensity scalar, which is 4 scalar operations. The result is then multipled with 1.7, which is another 4 scalar operations, for a total of 8. Now try this:

float4 result = In.color.rgba * (In.intensity * 1.7);

Intensity is now multiplied by 1.7, which is a single operation, and then the result is multiplied with color, which is 4, for a total of five scalar operations. A save of three instructions by merely placing parantheses in the code.

Shouldn't the compiler be smart enough to figure this thing out by itself? Not really. HLSL will sometimes merge constants when it considers this safe to do. However, when dealing with variables that have values with unknown range the compiler cannot make the assumption that multiplying in another order will give the same result. For instance 1e-20 * 1e30 * 1e10 will result in 1e20 if you multiply left to right, whereas 1e-20 * (1e30 * 1e10) will result in an overflow and return INF.

In general I recommend that you even place parantheses around compile-time constants to make sure the compiler merge them when appropriate.

Shader programming tips #3
Monday, February 9, 2009 | Permalink

Multiplying a vector by a matrix is one of the most common tasks you do in graphics programming. A full matrix multiplication is generally four float4 vector instructions. Depending on whether you have a row major or column major matrix, and whether you multiply the vector from the left or right, the result is either a DP4-DP4-DP4-DP4 or MUL-MAD-MAD-MAD sequence.

In the vast majority of the cases the w component of the vector is 1.0, and in this case you can optimize it down to three instructions. For this to work, declare your matrix as row_major. If you previously was passing a column major matrix you'll need to transpose it before passing it to the shader. You need a matrix that works with mul(vertex, matrix) when declared as row_major. Then you can do something like this to accomplish the transformation in three MAD instructions:

float4 pos;

pos  = view_proj[2] * vertex.z + view_proj[3];

pos += view_proj[0] * vertex.x;

pos += view_proj[1] * vertex.y;

It should be mentioned that vs_3_0 has a read port limitation, and since the first line is using two different constant registers, HLSL will put a MOV instruction in there as well. But the hardware can be more flexible (for instance ATI cards are). In vs_4_0 there's no such limitation and HLSL will generate a three instruction sequence.

Shader programming tips #4
Monday, April 27, 2009 | Permalink

The depth buffer is increasingly being used for more than just hidden surface removal. One of the more interesting uses is to find the position of already rendered geometry, for instance in deferred rendering, but also in plain old forward rendering. The easiest way to accomplish this is something like this:

float4 pos = float4(In.Position.xy, sampled_depth, 1.0);
float4 cpos = mul(pos, inverse_view_proj_scale_bias);
float3 world_pos = cpos.xyz / cpos.w;

The inverse_view_proj_scale matrix is the inverse of the view_proj matrix multiplied with a scale_bias matrix that brings the In.Position.xy from [0..w, 0..h] into [-1..1, -1..1] range.

The same technique can of course also be used to compute the view position instead of the world position. In many cases you're only interested in the view Z coordinate though, for instance for fading soft particles, fog distance computations, depth of field etc. While you could execute the above code and just use the Z coordinate this is more work than necessary in most cases. Unless you have a non-standard projection matrix you can do this in just two scalar instructions:

float view_z = 1.0 / (sampled_depth * ZParams.x + ZParams.y);

ZParams is a float2 constant you pass from the application containing the following values:

ZParams.x = 1.0 / far - 1.0 / near;
ZParams.y = 1.0 / near;

If you're using a reversed projection matrix with Z=0 at far plane and Z=1 at near plane you can just swap near and far in the above computation.

Shader programming tips #5
Friday, May 1, 2009 | Permalink

In the comments to Shader programming tips #4 Java Cool Dude mentioned that for fullscreen passes you can pass an interpolated direction vector and use the linearized depth to compute the world position. Basically with view_z computed using the math in #4 you compute:

float3 world_pos = cam_pos + In.dir * view_z;

This amounts to only two scalar operations and one float3 to carry out.

What about regular non-fullscreen passes? Turns out you can do that as well using a nice DX10 feature. The problem is that we need to interpolate the direction vector in screen space, rather than doing perspective correction. For screen aligned primitives it's the same thing, so it works out in this case, but for "normal" mesh data it's a completely different story. In DX10, and also in GLSL, there's now a noperspective keyword you can add to your interpolator, which changes the interpolation mode to eliminate the perspective correction, thus giving you an interpolation that's linear in screen space instead.

How do we compute the direction vector? Just take the position you're writing out from the vertex shader and push it to the far plane. Depending on if you're using a reversed projection matrix or not you either want Z=0 or Z=1, which can be done with using float4(Out.Position.xy, 0, Out.Position.w) or Out.Position.xyww respectively in homogenous coordinates. Transform this vector with the inverse view_proj matrix to get the world position of the far plane equivalent of the point in world space. Now subtract cam_pos from this and that's the direction vector. Instead of subtracting the cam_pos in the vertex shader you can just bake that into the same matrix and get it for free. The resulting vertex shader snippet for this is something like this:

float4 dir = mul(view_proj_inv, Out.position.xyww);
Out.dir = dir.xyz / dir.w;

Finally, note that view_z as computed in #4 goes from 0 at the camera to far_plane at the far clipping plane. For this computation we need it to be 1.0 at the far clipping plane, which can be done by simply multiplying ZParams with far_plane. Alternatively you can divide Out.dir with far_plane in the vertex shader.

posted on 2009-11-18 09:55 大河马和小魔鱼阅读(592) 评论(0) 收藏举报