środa, 9 kwietnia 2014

Tangent Basis in Compact Form

Recently at Flying Wild Hog my task was to change the vertex format of meshes. Both static and skinned meshes used 12 bytes for storing normal, tangent, bitangent and color of the vertex. This looked something like this:
UByte4N normal;
UByte4N tangent;
UByte4N bitangent;
Each tangent basis vector is stored explicitly on 24 bits and the last 8 bits of each vector are intended to store one component of the vertex color (red stored in normal's alpha, green in tangent's alpha and blue in bitangent's alpha).

The UByte4N type internally stores one 32-bit unsigned integer (hence the U prefix) and has some helper functions for packing and unpacking four 8-bit values. Obviously, as tangent basis vector's components are floats in normalized $[-1, 1]$ interval (hence the N postfix meaning normalized), there are also functions which convert it to decimal $[0, 255]$ interval.

This form of the tangent basis with color packed together is actually quite nice. The precision is not bad and all channels are fully utilized. The biggest problem is that in order to get color we must read three vectors. We can do better though.

The first thing is to get rid off the bitangent vector altogether. The whole tangent basis can nicely be packed with just normal, tangent and one value that indicates the handedness of the entire basis. To get the handedness of the basis you can compute the determinant of a matrix that is formed of that basis - it will be either $1$ or $-1$ for an orthonormal basis. This way we save two values (one handedness value as opposed to three defining the bitangent vector) and we can extract the color to a separate vector:
UByte4N normal; // .w unused
UByte4N tangent; // .w stores handedness
UByte4N color;
It looks like now the color has been enriched with the alpha channel. Also, the $w$ component of the normal vector is unused. And hey, do we really need to waste 8 bits for the handedness? One value would suffice. Fortunately, GPUs are equipped with RGB10A2 format, where 10 bits are devoted to RGB channels and 2 bits for the alpha channel. This way we can store the tangent vector with greater precision. So, our new tangent basis and color part of the vertex structure is this:
UX10Y10Z10W2N normal; // .w unused
UX10Y10Z10W2N tangent; // .w stores handedness
UByte4N color;
Now this one looks scary at first glance. The U prefix again means that the stored values are unsigned - decimal $[0, 1023]$ interval for the $x$, $y$, and $z$ components and $[0, 3]$ for the $w$ component. Also, the N postfix means the that input float values to the structure must be normalized, that is in $[-1, 1]$ interval.

Finally, some piece of working code:
struct UX10Y10Z10W2N
{
    UInt32 m_value;

    void Set( const float* xyz, int w )
    {
        m_value = 0;

        for ( int i = 0; i < 3; i++ )
        {
            float f = ( xyz[ i ] + 1.0f ) * 511.5f;

            if ( f < 0.0f )
                f = 0.0f;
            else if ( f > 1023.0f )
                f = 1023.0f;

            m_value |= ( ( int )f ) << 10*i;
        }

        m_value |= w << 30;
    }

    void Get( float& x, float& y, float& z ) const
    {
        x = ( float )( ( m_value >> 0  ) & 1023 );
        y = ( float )( ( m_value >> 10 ) & 1023 );
        z = ( float )( ( m_value >> 20 ) & 1023 );

        x = ( x - 511.0f ) / 512.0f;
        y = ( y - 511.0f ) / 512.0f;
        z = ( z - 511.0f ) / 512.0f;
    }

    void Get( float& x, float& y, float& z, int& w ) const
    {
        Get( x, y, z );
        w = ( ( m_value >> 30 ) & 3 );
    }
};
I'm leaving you without much explanation as it should be quite easy to read. One thing worth noting is that the $w$ component is not passed or read as float but rather as int. That is because I determined it does not make much sense dealing with float here as we have barely 2 bits. I pass here $0$ or $3$ (which will be converted to $1$ in the shader) depending on the handedness of the tangent basis.

sobota, 22 lutego 2014

Edge Detection from Depths and Normals

Edge detection is an algorithm that comes handy from time to time in every graphics programmer's life. There are various approaches to this problem and various tasks that we need edge detection for. For instance, we might use edge detection to perform anti-aliasing (like in here http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter09.html) or to draw some stylish outlines. So how can we detect the edges? Some algortihms compare color differences between the pixel we are currently processing and its neighbours (like antialiasing algorithms FXAA and MLAA). Others, like the one I linked a few words before, use the rendered scene's geometry information, like the depth-buffer and normals, to do so. In this post I present a variant of this algorithm.

The algorithm I use needs three pixels. The one we're processing and some two "different" neighbours. You can take the one to the left along with the one to the top for instance. Now, we read normal vectors (either in view or world space) at those pixels and depths. We also recover view or world space positions from the sampled depths (I showed how to do this here http://maxestws.blogspot.com/2013/11/recovering-camera-position-from-depth.html).

So, to sum up at this point, we have three (I assume view space) normals ($N$, $N_{left}$ and $N_{top}$) and positions ($P$, $P_{left}$ and $P_{top}$). Here comes the shader code:
bool IsEdge(
    float3 centerNormal,
    float3 leftNormal,
    float3 topNormal,
    float3 centerPosition,
    float3 leftPosition,
    float3 topPosition,
    float normalsDotThreshold,
    float distanceThreshold)
{
    // normals dot criterium
    
    float centerLeftNormalsDot = dot(centerNormal, leftNormal);
    float centerTopNormalsDot = dot(centerNormal, topNormal);
        
    if (centerLeftNormalsDot < normalsDotThreshold ||
        centerTopNormalsDot < normalsDotThreshold)
    {
        return true;
    }
    
    // distances difference criterium
    
    float3 v1 = leftPosition - centerPosition;
    float3 v2 = topPosition - centerPosition;
    
    if (abs(dot(centerNormal, v1)) > distanceThreshold ||
        abs(dot(centerNormal, v2)) > distanceThreshold)
    {
        return true;
    }    
    
    //
        
    return false;
}
Good starting value for normalsDotThreshold is $0.98$ and for distanceThreshold it's $0.01$.

The code uses two criterions to determine whether we are on an edge. The first checks dots of the normals. It's purpose is to detect edges that appear on the continous surface but with varying normals, where, for instance, two perpendicular planes meet (think of a side of a box standing on a floor).

The second criterium checks for distances. Imagine two planes parallel to each other but at different heights. When viewed in such a way that an edge of the top plane appears in front of the bottom plane we clearly have, well, an edge to detect. The normals dot product won't work because normals of all pixels here are the same (the planes are parallel). So in this case we need to track vectors from the center pixel to the neighbouring pixels. If they are too far away then we have an edge. Note here that we actually dot those vectors with the center pixel's normal. This is to avoid false positive cases for pixels who are on the same plane but are far away from each other. In that case, obviously, we don't care they are far away. They lie on the same plane so there is no edge.

I know that a picture is worth more than a thousand words. That's why I preferred to described the code with words instead of putting images here. Wait... what? That's right. Your's best bet is to just take this simple piece of code and try it in practice to see exactly how it works. I highly recommend doing the following modifications and see what happens:
1. Leave just the normals dot criterium.
2. Leave just the distances difference criterium.
3. Normalize vectors $v_1$ and $v_2$ in the distances difference criterium.

Have fun :).

poniedziałek, 18 listopada 2013

Recovering Camera Position from Depth

Memory bandwidth is a precious resource on the current generation GPU hardware. So precious that it is often the case when storing less data and using packing/unpacking algorithms can be much faster than storing fat data in unpacked form. The most common example of this issue that comes to my mind are deferred shading/lighting algorithms.

In deferred shading we usually need to store, in some offscreen buffer, positions of pixels that we are going to light. The simplest form is to use the coordinates of positions directly, that is, three floats per pixel. That forces us to use either RGBA32F format or, if we can get away with lower precision, RGBA16F format. In either case that is a lot of memory.

Can we do better? Well, yes. A common solution to this problem is to use, for instance, the depth buffer and recover pixels' positions from it. We should be able to do that, right? If we could get from camera space to projection space and store values in the depth buffer, then we can do the reverse. To do that we just need to take the value from the depth buffer and find $x$ and $y$ normalized device coordinates of the pixel we're processing (those are fairly easy to find). Once we have those values we just need to multiply a vector composed of these values (plugging $w=1$) and multiply it by the inverse projection matrix. That would give us position of the pixel in camera space.

So are we done? Technically, yes. There are some flaws to this approach however. The first one is that the depth buffer is not always available on all hardware (mobile devices for instance). That is not so much of a problem as we can generate a "custom" depth buffer that will most likely also be used for other effects, like soft particles and such. The second one is performance. The vector-matrix multiplication must take place in the pixel shader where every single instruction counts. The sole vector-matrix mul is four dot products. We might also need to normalize the output vector so that its $w=1$, which is one division.

There is fortunately a better way to recover the camera space position from depth the will take barely one mul in the pixel shader! Let's have a look at the following picture:


The picture shows a side view of a camera's frustum. Point $(y, z)$ lies on the near plane. Our task is to find $y'$ of some point of interest. The value of $z'$ is known - it is just the linear depth value in camera space of that point. We can get it either by converting the normalized device coordinate (NDC) $z$ from the depth buffer to linear space (a word on this at the end of the post) or we can utilize the separate depth buffer we mentioned above. This separate buffer (which we will have to use when we don't have access to the hardware depth buffer) does not have to store the NDC $z$ values but rather linear camera space $z$ values. This is actually even more desirable for many practical applications.

So, we know $z'$. We want to find $y'$ but first we also need to find $y$ and $z$ values. This couple represents the coordinates of our point of interest on the camera's near plane. Given point's position in NDC we can calculate the coordinates on the near plane like this (in the vertex shader)
  vec2 position_NDC_normalized = 0.5 * position_NDC.xy / position_NDC.w;
  varying_positionOnNearPlane = vec3(nearPlaneSize.xy * position_NDC_normalized, zNear);
First, we scale the NDC position so that all coordinates are in $[-0.5, 0.5]$ range. Then we multiply those coordinates by the size (width and height) of the near plane and plug the camera's distance to the near plane zNear to the $z$ coordinate. This resulting vector (varying_positionOnNearPlane) can now be interpolated and read in the pixel shader.

As we now know $y$, $z$ and $z'$, evaluating $y'$ is easy - we just employ similiar triangles:
\begin{eqnarray}
y' = \frac{y * z'}{z}
\end{eqnarray}
The pixel shader code:
  float linearDepth = texture2DProj(gbufferLinearDepthTexture, varying_screenTexCoord).x;
  vec3 position_view = varying_positionOnNearPlane * linearDepth / varying_positionOnNearPlane.z;
We're almost there. I promised there would be only one mul but at the moment there is one mul and one div. It might not be so obvious at first glance but the $z$ coordinate of varying_positionOnNearPlane is constant for all points - it's just the distance from the camera to the near plane so this division can be moved to the vertex shader.

And that's it. With a few simple observations and math "tricks" we managed to write an optimal pixel shader for camera space position reconstruction from linear depth.

One last thing that remains untold is how to recover linear depth from the hardware depth buffer. As we know, to get from camera space to projection (NDC) space we use some sort of projection matrix. We can use this fact to derive a simple formula that only transforms the $z$ coordinate from projection space to camera space. So let's say we have a vector $[x, y, z, 1]$ defining a point in camera space and we multiply it by a projection matrix (matrix taken after http://www.opengl.org/sdk/docs/man2/xhtml/gluPerspective.xml; note that we use row-major order and collapsed the entries' formulas) to get the vector in the projection space:
\[
\begin{bmatrix}
x & y & z & 1
\end{bmatrix}
\begin{bmatrix}
A & 0 & 0 & 0 \cr
0 & B & 0 & 0 \cr
0 & 0 & C & -1 \cr
0 & 0 & D & 0
\end{bmatrix}
=
\begin{bmatrix}
xA & yB & zC + D & -z
\end{bmatrix}
\sim
\begin{bmatrix}
\frac{xA}{-z} & \frac{yB}{-z} & -(C + \frac{D}{z}) & 1
\end{bmatrix}
\]
The term that goes into the depth buffer is $-(C + \frac{D}{z})$, where $z$ is the linear depth. Assuming the value stored in the depth buffer that we read is $z'$, we just need to solve the following equation for $z$:
\begin{eqnarray}
-(C + \frac{D}{z}) = z' \cr
z = \frac{-D}{z' + C}
\end{eqnarray}
The constants $C$ and $D$ can be extracted from the projection matrix and passed to the pixel shader.

DEDYKACJA: tę notkę dedykuję Karolinie P., która w chwili obecnej nieco zmaga się z algebrą abstrakcyjną oraz programowaniem w C++, jednak jestem pewien, że jej upór i charakter pozwolą jej przebrnąć przez zarówno te zajęcia jak i cały semestr :)

środa, 3 października 2012

Fast Texture Projection onto Geometry on the GPU

While implementing various graphical algorithms, a problem that must be solved from time to time is projection of texture onto some geometry. This happens, for instance, when we want to make some water reflection or cast g-buffer textures onto light's geometry.

The procedure is pretty straightforward. If we multiply a vertex's position by the whole world-view-projection matrix and perform the perspective division, we get vertex's position in normalized device coordinates, which all are in $[-1, 1]$ interval. If we wanted to project a texture onto geometry, we would need to do exactly the same thing (but use world-view-projection matrix of some sort of "projector", not the game's "camera") and remap the vertex's final $x$ and $y$ coordinates from $[-1, 1]$ interval to $[0, 1]$ interval, thus successfully generating texture coordinates, which can be used to sample a texture.

Let's say that we have a vertex shader like this:
uniform mat4 worldViewProjTransform;
uniform mat4 projectorWorldViewProjTransform;

attribute vec4 attrib_position;

varying vec4 varying_projectorTexCoord;

void main()
{
  gl_Position = attrib_position * worldViewProjTransform;

  varying_projectorTexCoord = attrib_position * projectorWorldViewProjTransform;
}
In this vertex shader we simply multiply the vertex's position by the projector's matrix. Perspective division and coordinates remap does not occur here yet. It will in the fragment shader:
uniform sampler2D projectorSampler;

varying vec4 varying_projectorTexCoord;

void main()
{
  vec4 projectorTexCoord = varying_projectorTexCoord;
  
  projectorTexCoord /= projectorTexCoord.w;
  projectorTexCoord.xy = 0.5 * projectorTexCoord.xy + 0.5;
  
  gl_FragColor = texture2D(projectorSampler, projectorTexCoord.xy);
}
In this pixel shader we first do the perspective division and then, when we know we are in $[-1, 1]$ interval, we can do the remapping part.

The title of this post contains word "fast". Is our approach fast? Well, with today's hardware it actually is but we can do better. The very first thing that should capture your attention is the division in the pixel shader. Can we do something about it? Yup. There is a function in basically any real-time shading language like texture2DProj which, before taking a sample from the texture, will first divide all components of the second argument by the 4th component, $w$. So, we can try doing this:
uniform sampler2D projectorSampler;

varying vec4 varying_projectorTexCoord;

void main()
{
  vec4 projectorTexCoord = varying_projectorTexCoord;

  projectorTexCoord.xy = 0.5 * projectorTexCoord.xy + 0.5;
  
  gl_FragColor = texture2DProj(projectorSampler, projectorTexCoord);
}
Will this work? Of course not. That is because we switched the order of operations. In this version of code the projection happens after remapping whereas it should occur before. So, we need to fix the remapping code.

Remember that the projected coordinates $[x, y, z, w]$ that come into the pixel shader are all in $[-w, w]$ interval. This means that we want to remap the coordinates from $[-w, w]$ to $[0, w]$. Then, the divison can be performed either by dividing manually and calling texture2D or by calling texture2DProj directly. So, our new pixel shader can look like this:
uniform sampler2D projectorSampler;

varying vec4 varying_projectorTexCoord;

void main()
{
  vec4 projectorTexCoord = varying_projectorTexCoord;

  projectorTexCoord.xy = 0.5 * projectorTexCoord.xy + 0.5 * projectorTexCoord.w;
  
  gl_FragColor = texture2DProj(projectorSampler, projectorTexCoord);
}
Are we done yet? Well, there is actually one more thing we can do. We can shift the "$0.5$" multiplication and addition to the vertex shader. So the final vertex shader is:
uniform mat4 worldViewProjTransform;
uniform mat4 projectorWorldViewProjTransform;

attribute vec4 attrib_position;

varying vec4 varying_projectorTexCoord;

void main()
{
  gl_Position = attrib_position * worldViewProjTransform;
 
  varying_projectorTexCoord = attrib_position * projectorWorldViewProjTransform;
  varying_projectorTexCoord.xy = 0.5*varying_projectorTexCoord.xy + 0.5*varying_projectorTexCoord.w;
}
and the final pixel shader is:
uniform sampler2D projectorSampler;

varying vec4 varying_projectorTexCoord;

void main()
{
  vec4 projectorTexCoord = varying_projectorTexCoord;

  gl_FragColor = texture2DProj(projectorSampler, projectorTexCoord);
}
Yes. Now we are done.

One might wonder if we could do the perspective division in the vertex shader and call texture2D instead of texture2DProj in the pixel shader. The answer is: we can't. The hardware, during rasterization, interpolates all vertex-to-pixel shader varyings. Assuming that varying_projectorTexCoord is $[x, y, z, w]$ vector, the value that is read in the pixel shader is $[\mbox{int}(x), \mbox{int}(y), \mbox{int}(z), \mbox{int}(w)]$, where $\mbox{int}$ is an interpolating function. If we performed the division in the vertex shader, the vector coming to the pixel shader would be of the form $[\mbox{int}(\frac{x}{w}), \mbox{int}(\frac{y}{w}), \mbox{int}(\frac{z}{w}), 1]$. And obviously $\mbox{int}(\frac{x}{w}) \neq \frac{\mbox{int}(x)}{\mbox{int}(w)}$ (same for other components). Note that the right hand side of this inequality is what happens when we do the perspective division in the pixel shader

Instead of doing all this trickery we could include a special remapping matrix into the chain world-view-projector of the projector's matrix. What this matrix looks like can be found here (this is also a great tutorial about texture projection in general by the way). The problem with this approach is that you might not always have this matrix injected into the whole chain and for some reason you either don't want or even can't do it. You could, naturally, create this remapping matrix in the vertex shader but that would raise the number of instructions. It is better to use the approach discussed throughout this post in that case, as it will make things running much quicker.

piątek, 22 czerwca 2012

Interactive Physics-based Grass Animation

Recently at Madman Theory Games I had quite an interesting task to do. We're working on a Diablo-like hack & slash and the lead artist wanted grass (or actually some overgrown form of grass...) to deflect when, for instance, the player is wading through or when a fireball is sweeping over it. The general idea is rather trivial: compute deflected vertices positions based on some deflectors. My initial idea was to treat an entity (like the player) as a spherical deflector, and displace vertices positions based on the distance from the deflector. So when a grass's blade is right next to the player, it is displaced by some fixed maximum amount. At the border of the sphere it is not displaced at all. This worked just fine, but made the grass blades very stiff, with completely no spring-like motion when the deflector stops affecting the blade. So put simply, it was unnatural.

To make blades swing a little like springs when a deflector stops affecting it I thought about adding some fake sinusoidal movement. That would however require tracking time telling how long a blade is not affected by deflectors at all, and scale the sinusoidal deflection based on that time. I didn't try this but I'm pretty sure that this approach would yield some problems, both with stability and realism.

I eventually came to an inevitable conclusion - I need some simple, but physics-based model to control the blades. I thought that if I implemented a simple blades movement based on some external forces, it would be much easier and prdictable to control. So the first phase was to create a test square in a 2D world and try moving it by attaching external forces to it. This turned out to be piece of cake.

The test square has a position $\mathbf{p}$ and velocity $\mathbf{v}$. We know that to move the square we should do this each frame:
\begin{eqnarray}
\mathbf{p} = \mathbf{p} + \mathbf{v} \Delta t
\end{eqnarray}
To model external forces affecting the square we need to find a way to modify $\mathbf{v}$. We know that the net force affecting the square has a value $\mathbf{F}$. We also know some of the most important formulas in the world:
\begin{eqnarray}
\mathbf{F} &=& m \mathbf{a} \cr
\mathbf{a} &=& \frac{\mathbf{F}}{m} \cr
\mathbf{a} &=& \frac{\Delta \mathbf{v}}{\Delta t} \cr
\Delta \mathbf{v} &=& \mathbf{a} \Delta t = \frac{\mathbf{F}}{m} \Delta t
\end{eqnarray}
So, we have a formula for $\Delta \mathbf{v}$. This is a value by which we should increase $\mathbf{v}$ each frame.

After all this math, let's see some code:
// do once

  vec3 position = vec3(0.0f, 0.0f, 0.0f);
  vec3 velocity = vec3(0.0f, 0.0f, 0.0f);

// do each frame 

  vec3 force = vec3(0.0f, 0.0f, 0.0f);
  float mass = 1.0f; 

  // modify force here

  vec3 acceleration = force / mass;
  vec3 velocity_delta = acceleration * deltaTime;
  velocity += velocity_delta;
  position += velocity * deltaTime;
This is all we need to have a simple force-based physical simulation.

Given this simple system we enter the second phase: we need to add proper forces to model believable grass motion. First of all we need a force that will deflect the blades. To do this I simply check if a blade is in a spherical deflector and apply a force vector $bladePosition - deflectorPosition$, normalized and scaled by some strength parameter. I call this stimulate force, as it stimulates the movement of the blades.

Stimulate force works very nice but it's not sufficient. You probably recall the first law of motion that says that if no force acts on a particle, it stands still or moves with a uniform speed. In our case the stimulate force causes movement and once we turn this force off, the blades would deflect into infinity because they are already in move, and no force acts on them! So we definitely need something that could stop them - some sort of friction is needed. My long-term friend and a colleague at work chomzee, who is much more skilled in physics than I am, suggested a simple solution - simply add a force that acts in the direction opposite to the velocity of a blade. This will look like this:
 force -= friction * velocity;
Note here that we use the velocity of a blade/particle from the previous frame to modify the force. This force is then applied to compute new velocity in the current frame.

Okay, so far we have two forces: stimulate and friction. Stimulate will make grass blades deflect, thus induce some movement, and the friction will bring the blades to a halt eventually. But there is one more piece to this puzzle. The grass blades can't stay deflected once a deflector stops acting on them. The blades need to get back to their initial positions. Here's where spring model comes handy. We can treat our grass blades as springs and use very simple spring force formula:
\begin{eqnarray}
\mathbf{F} = -k \mathbf{x}
\end{eqnarray}
In this formula $k$ is the resiliency of the spring, and $\mathbf{x}$ is a displacement vector from the initial position. The code that simulates this is:
 force -= resiliency * (position - position_initial);
Honestly, that is pretty much it! Using a couple lines of code and some basic understanding of Newton's laws of motion I was able to simulate a very, very realistic interactive grass animation. The game we are working on should have been finished somewhere by the end of this year so you will have the opportunity to see this system in a comercial product pretty quickly :).

There is one last thing I would like to mention. In the beginning of this post I showed a formula that we use to accumulate (integrate, so-called Euler integration) particle's position based on velocity:
\begin{eqnarray}
\mathbf{p} = \mathbf{p} + \mathbf{v} \Delta t
\end{eqnarray}
This equation is actually just an approximation to the real position of the particle after a simulation step. To be more precise, this is the first order approximation. We can modify this formula to get slightly more accurate solution (but still an approximation!):
\begin{eqnarray}
\mathbf{p} = \mathbf{p} + \mathbf{v} \Delta t + \mathbf{a} \frac{{\Delta t}^2}{2}
\end{eqnarray}
Does it look familiar? This is the second-order approximation. I would love to say something more about but I'm still a newbie when it comes to physics so you would have to wait a couple of years to hear a more solid and credible explanation from me :).

sobota, 28 kwietnia 2012

Managing Shaders Combinations

Recently, when working on the engine (BlossomEngine) for a game that I am co-creating (http://gct-game.net) I encountered a problem that ultimtely every engine programmer must face - managing shader combinations.

It is often the case that in the engine we have some super-duper shader(s) with many switches. Such shaders are called uber-shaders. The idea of an uber-shader is that we do not write a separate shader (file) for every single surface, but rather we have on big shader where we can turn some features on and off, like the use of specular lighting on the surface, soft shadows, or anything else that can describe the material.

One way to accomplish this is by using static branching within the shader, using constant values supplied to the shader. This solution has some flaws, however. The first is that we generate more if-then instructions in the shader, which simply makes the shader longer. The other one is that we waste shader constants on feeding the shader with appropriate steering data.

The second way to manage an uber-shader is to generate completely separate shaders by using #ifdefs in the shader and thus avoiding the need to supply the shader with any constants steering data. The downside is that we will now generate a bunch of shaders which we have to manage somehow in the application. Note that the number of generated shaders grows exponentially with the number of #ifdefs. For example, consider a situation where our shader contains two #ifdefs, A and B. The possible combinations are:

No #ifdef
A
B
AB

Now, if we decide to add a third #ifdef C, we will have to double all shaders that we have so far, with new #ifdef C added to each mix:

No #ifdef
A
B
AB
C
AC
BC
ABC

So the general formula for the number of generated shaders is $2^n$, where n is the number of #ifdefs.

In BlossomEngine the most complex shader I have is a forward-shaded mesh sunlight shader. I must say that at the beginning of development, when I had a very, very few #ifdefs (normal mapping, specular lighting) I didn't bother about having any sophisticated manager. So I simply declared each combination as a separate shader, like this:
CShader meshSunlightPixelShader;
CShader meshSunlightPixelShader_NormalMapping;
CShader meshSunlightPixelShader_Specular;
CShader meshSunlightPixelShader_NormalMapping_Specular;
Obviously, as the game engine evolves, you are developing more and more complex shaders. So this way I ended up having 16 different combinations, what was really troublesome to manage in the application code. I decided to write some automatic system that would generate an array of shaders from given parameters. So I did that. My new class CShadersPack was able to generate an array of CShader objects from a single shader file that contains #ifdefs. A call to init function of that class can look like this:
meshSunLightPixelShadersPack.init("./BlossomEngine/shaders/mesh_sun_light.ps",
 stPixelShader, "-DSPECULAR -DCUBE_MAP_REFLECTIONS -DSOFT_SHADOWS 
-DDIFFUSE_SHADING_NORMAL -DDIFFUSE_SHADING_NORMAL_MAPPED 
-DAMBIENT_SHADING_NORMAL_MAPPED -DRIM_LIGHTING");
The list of all #ifdefs is passed as one string, with all #ifdefs separated with a space. The function splits that string and generates all combinations.

Now the question is, when you already have those shaders generated, how to get the one you need from the CShadersPack object. Say we have A, B and C #ifdefs, and we want to get AC shader. One way to do this would be to search through the shaders array in CShadersPack and find the one with name that matches the AC combination. I think I don't have to point out how wasteful in CPU time that would be. There is a much better way. Let's again see how combinations are generated. First, we have #ifdef A. Combinations are:

No #ifdef
A

Now we add B:

No #ifdef
A
B
BA

Now C:

No #ifdef
A
B
BA
C
CA
CB
CBA

So as you can see, there is some regularity. For instance, if we are starting with shaderIndex = 0 and we want to pick a shader with feature A, we will have to add $1$ to shaderIndex. If we want also feature B, we must add $2$, if we want feature C, we must add $4$, and so on... So to get AC combination we get shaderIndex = 1 + 4 = 5. By looking at the list above we see that indeed, CA is 5th on the list (0-based indexing). Adding a couple of integers and indexing the shaders array will probably be much faster than looping through the array and comparing strings. In real code this looks like this:
int shaderIndex = 0;

if (useSpecular)
 shaderIndex += 1;
if (useReflectionsCubeMap)
 shaderIndex += 2;
if (useSoftShadows)
 shaderIndex += 4;
if (meshesInstances[i]->material.diffuseShadingType != CMeshMaterial::dstNone)
 shaderIndex += 8;
if (meshesInstances[i]->material.diffuseShadingType == CMeshMaterial::dstNormalMapped)
 shaderIndex += 16;
if (meshesInstances[i]->material.ambientShadingNormalMapped)
 shaderIndex += 32;
if (useRimLighting)
 shaderIndex += 64;

Okay, so at this point I had a nice system to manage shader combinations. Unfortunately, I quickly discovered an annoying problem - with nested #ifdefs. In my mesh sunlight shader I have #ifdef called DIFFUSE_SHADING_NORMAL. If it is turned on, the shader will perform classical NdotL lighting based on interpolated vertex normal. However, I want to have the ability to perform the shading using normal from a normal map. So inside DIFFUSE_SHADING_NORMAL I have DIFFUSE_SHADING_NORMAL_MAPPED which takes normal from the normal map into the computations. Does this cause a big problem? Yup, it does.

Let's now assume we have a shader with #ifdef A, B and C, but C is located inside the #ifdef B. Something like this:
#ifdef A
 A code
#endif

#ifdef B
 B code
 #ifdef C
  C code
 #endif
 B code
#endif
Combinations generated with my system would still be:

No #ifdef
A
B
BA
C
CA
CB
CBA

But hey, since C is dependent on B, the combination AC will generate exactly the same code as combination A. As we add more and more #ifdefs we will have more and more duplicated shaders. This is wasteful in terms of compilation/assembling time, as well as in terms of wasted GPU memory to store those duplicated shaders.

A way around this problem would be to, put simply, generate combinations in a more intelligent way. This raises some new problems, however, like computing a proper index to the shaders array in CShadersPack, or passing the #ifdefs list to CShadersPack::init. The dependencies would have to marked in some way. This solution can certainly be done, but fortunately to me I figured a much easier to implement and manage way to handle this problem.

First of all, the general idea of generating arguments combinations remains the same, so we always have $ 2^n$ combinations, where n is the number of arguments. However, I do not generate the shaders immediately. Instead, a particular shader combination is generated when a mesh with particular material is to be used. So let's say I want to use a shader that only has rim lighting (see "real" code above). In that case I look for an arguments combination under index $64$. I see that it has not been used, so I generate that shader, store the pointer in the shaders array, and hold index into that shaders array in the arguments combinations list. Here's the code for CShadersPack class:
class CShadersPack
{
 struct ShaderCombination
 {
  std::string args;
  int indexToShadersList;
 };

private:
 std::string fileName;
 ShaderType shaderType;

 int combinationsNum; // 2 raised to number of args
 int shadersNum; // number of actually generated shaders

 ShaderCombination *combinationsList;
 std::vector< CShader* > shaders;

public:
 void init(const std::string &fileName, ShaderType shaderType, const std::string &args = "", const std::string &argsToAffix = "");
 void free();

 CShader* getShader(int index);
};
As you can see, there is ShaderCombination structure. The list of that type (combinationsList) is generated a priori, and has the size of $2^n$, where n is the number of arguments. indexToShadersList is initialized with $-1$ value, indicating that this particular combination does not have a shader generated yet. Once the shader is generated, this value will point to shaders vector.

Let's now see how CShadersPack::getShader looks like:
CShader* CShadersPack::getShader(int index)
{
 if (combinationsList[index].indexToShadersList == -1)
 {
  combinationsList[index].indexToShadersList = shaders.size();

  CShader* shader = new CShader();
  shader->init(fileName, shaderType, combinationsList[index].args);
  shaders.push_back(shader);
 }

 return shaders[combinationsList[index].indexToShadersList];
}
So, if indexToShadersList is still $-1$, it means we have to compile/assemble the shader and store it in the shaders list. Finally, a proper pointer to the shader is returned.

This solution is simple, robust and easy to implement and manage. One flaw is that we cannot generate all shaders (actual shaders) when the game is starting because we don't know what materials are used, so we have to generate them during actual gameplay. This will however introduce some slight jerking once an object with a new material combination shows up. But hey, can't we just cache already used arguments combinations on the disk, and generate shaders with those arguments once the game is starting? Yes, we can! And that is what my engine does.

czwartek, 3 listopada 2011

Moving between Coordinate Frames

Perhaps the most common thing one does when works in the field of computer graphics is doing transformations, and one of them is moving objects from one frame of reference to another. A classical example is camera transformation. We have our objects defined in local space, or world space, and then, to make projection transformation simple, we move all objects to camera space. Camera space is a space where all objects are positioned in such a way as if the camera was positioned in point $(0, 0, 0)$ and looking directly toward $-Z$ (right-handed system) or $+Z$ (left-handed system) axis. When I read about how the view transform matrix is carried out, I wasn't too much surprised. The idea is rather simple. First, we want to translate the camera to the origin, and then we want to orient it such that it looks down the $Z$ axis. We can do so using two matrix transformations, the first one is a translation matrix (filled with the negative coordinates of the camera's position) and the second one is a rotation matrix (filled with the orientation vectors of the camera). The translation matrix was obviously straightforward to understand but I could never fully grasp why the rotation matrix looked the way it looked). I went through basic vector/matrix math in a couple of books but nowhere could find the answer, with one exception. I found the answer in 3D Game Engine Design (2nd edition) by David H. Eberly. He wrote some simple formula from which on everything became immediately clear. So, let's start.

Let's first define our general problem of moving between coordinate frames (in 2D, for simplicity). Given some point $P$ we want to find its position $P'$ in some alternate frame of reference which is located at point $O$ and is spanned on two orthogonal vectors $R$ and $U$. The crucial condition here is that $P$, $O$, $R$ and $U$ are all defined in the same space, which is usually the case (if not, we can bring them using the very approach we're describing right now). The magic formula (which is nothing more than a simple linear combination of two vectors) then is:
\begin{eqnarray}
P = O + (P'_x*R) + (P'_y*U)
\end{eqnarray}
You can check empirically that this works. Just sketch a coordinate frame starting at $(0, 0)$ with two direction vectors $(1, 0)$ and $(0, 1)$. Then, choose some point $O$ in this coordinate frame, as well as vectors $R$ and $U$. And finally, pick some point $P$ and see yourself that the coordinates of $P$ in the starting coordinate frame correspond to the same point in the $(O, R, U)$ coordinate space. Here's some picture:


So let's say that $P = (P_x, P_y) = (5, 4)$, $O = (4.2, 2)$, $R = (\frac{\sqrt{2}}{2}, \frac{\sqrt{2}}{2})$, $U = (-\frac{\sqrt{2}}{2}, \frac{\sqrt{2}}{2})$. Then, the coordinates $(P'_x, P'_y)$ in the coordinate frame $(O, R, U)$ are, more or less, $P' = (2, 0.85)$. Let's now see how to find $P' = (P'_x, P'_y)$.

Okay, so we have one formula with two unknowns $P'_x$ and $P'_y$:
\begin{eqnarray}
P = O + (P'_x*R) + (P'_y*U)
\end{eqnarray}
Unsolvable? Of course it is. The equation above is actually not one but two equations:
\begin{eqnarray}
P_x &=& O_x + (P'_x*R_x) + (P'_y*U_x) \cr
P_y &=& O_y + (P'_x*R_y) + (P'_y*U_y)
\end{eqnarray}
We can now employ any technique to solve this system of equations for $P'_x$ and $P'_y$. However, we can use some assumptions that we have made to simplify the calculations. We assumed that $R$ and $U$ are orthogonal, which is usually the case, but doesn't have to be. If these vectors are not orthogonal then you need to solve the equations above as they are. But here, we take a shortcut, because we know that $R$ and $U$ are orthogonal (orthonormal, to be even more precise), so the following holds:
\begin{eqnarray}
R \cdot R &=& 1 \cr
R \cdot U &=& 0
\end{eqnarray}
Let's see what happens if we apply the dot product with $R$ on both sides:
\begin{eqnarray}
P &=& O + (P'_x*R) + (P'_y*U)\ \ \ / \cdot R \cr
P \cdot R &=& (O \cdot R) + (P'_x*R \cdot R) + (P'_y*U \cdot R) \cr
P \cdot R &=& (O \cdot R) + (P'_x*1) + (P'_y*0) \cr
P'_x &=& (P \cdot R) - (O \cdot R) \cr
P'_x &=& (P - O) \cdot R = (P_x - O_x) * R_x + (P_y - O_y) * R_y
\end{eqnarray}
Doing the same with $\cdot U$ applied to both sides will give:
\begin{eqnarray}
P'_y = (P - O) \cdot U = (P_x - O_x) * U_x + (P_y - O_y) * U_y
\end{eqnarray}
From these two equations we can now easily derive two matrices. The first one should translate the original point $P$ by $-O$. The second one is just filled with $R$ and $U$ vectors:
\[
\begin{bmatrix}
P'_x & P'_y & 1
\end{bmatrix}
=
\begin{bmatrix}
P_x & P_y & 1
\end{bmatrix}
\begin{bmatrix}
1 & 0 & 0 \cr
0 & 1 & 0 \cr
-O_x & -O_y & 1
\end{bmatrix}
\begin{bmatrix}
R_x & U_x & 0 \cr
R_y & U_y & 0 \cr
0 & 0 & 1
\end{bmatrix}
\]
Note that we're using homogenous coordinates here. Otherwise we wouldn't be able to "inject" translation into the matrix.

That's pretty much it.

EDIT 1:
Today is 7th of November. I actually put that note four days ago but have constantly modified it since then... mostly the LaTeX part. Seems that I finally got it working :).

EDIT 2:
Today is 8th of November. I made some big mistake in this note because I thought that it was not possible to solve for $P'_x$ and $P'_y$ if vectors $R$ and $U$ were not orthogonal. Of course, it is possible, and now the note explains how to do that.