Vulkan TAA实现与细节

2021年11月28日实时渲染

一周时间过去后，TAA终于集成到了引擎中。

TAA On:

TAA 的抗锯齿效果基本是最好的。

有关TAA的原理已经有很多的GDC分享PPT了，这里主要分享一下 Vulkan 中实现 TAA 的细节和注意点，源码可以直接看我的仓库，这里只会截一部分用作说明：

主要要点如下：

Velocity Buffer的写入与读取。
解决TAA Ghost问题的办法。
TAA 模糊问题解决
解决TAA Flickering问题的办法。

Velocity Buffer的写入与读取

velocity buffer表示一个物体在两帧之间的屏幕空间位移，写入速度时，不应该考虑TAA Dither。

// 当前帧的Jitter矩阵
mat4 curJitterMat = mat4(1.0f);
curJitterMat[3][0] += frameData.jitterData.x; // .xy 为当前帧的Jitter数据
curJitterMat[3][1] += frameData.jitterData.y;

outCurPosNoJitter =  frameData.camViewProj * worldPos; 
gl_Position = curJitterMat * outCurPosNoJitter; 

outPrevPosNoJitter = frameData.camViewProjLast * vec4(worldPrevPos, 1);

vec2 curPosNDC = outCurPosNoJitter.xy  /  outCurPosNoJitter.w;
vec2 prePosNDC = outPrevPosNoJitter.xy / outPrevPosNoJitter.w;

outVelocity = (curPosNDC - prePosNDC) * 0.5f;
outVelocity.y *= -1.0f; // vulkan

我的做法是把jitter数据传入到顶点着色器中，在正常的透视变换完成后，再来一次Jitter变换作为最终的光栅化顶点数据。

outVelocity.y *= -1.0f 是因为我在Gbuffer Pass 的Viewport中翻转了Y轴，这样，Vulkan就可以和DX12使用相同的坐标系了。

速度缓冲读取时，在一个3x3范围的寻找最近深度的点并取它的velocity作为输出：

vec2 getClosestVelocity(in vec2 uv, in vec2 texelSize, out bool isSkyPixel)
{
    vec2 velocity;
    float closestDepth = 0.0f;
    for (int y = -1; y <= 1; ++y)
    {
        for (int x = -1; x <= 1; ++x)
        {
            const vec2 st = uv + vec2(x, y) * texelSize;
            const float depth = texture(inDepth, st).x;
            if (depth >= closestDepth) // now always reverse z
            {
                velocity = texture(inVelocity, st).xy;
                closestDepth = depth;
            }
        }
    }
    isSkyPixel = (closestDepth <= 0.0f); // isSkyPixel 指示了天空背景，用于后续clamp aabb 范围的计算
    return velocity;
}

解决TAA Ghost

相机移动时，一些垃圾历史像素也会参与到画面插值，导致出现严重的拖尾现象。

解决办法是收集当前像素周围的数据，根据一些规则评估它们的数据差异，如果颜色差异过大或者亮度差异过大，则可以判定该像素为垃圾数据舍弃掉。（ClampHistory）

神秘海域4中的TAA实现利用groupshared memory, 首先对整个画面做一次模糊滤波处理，再进行ClampHistory，这样可以大大增加Clamp时的准确性，（同时也便于减少TAA Flicking）：

#define GROUP_SIZE  8
#define TILE_DIM    (2 * RADIUS + GROUP_SIZE)
#define RADIUS      1

shared vec3 Tile[TILE_DIM * TILE_DIM];
layout (local_size_x = GROUP_SIZE, local_size_y = GROUP_SIZE, local_size_z = 1) in;

void main()
{
    //...

    // Gather Current Pixel (Blur First)
    if (gl_LocalInvocationIndex < TILE_DIM * TILE_DIM / 4)
    {
        // ...
        const vec2 uv1 = (coord1 + 0.5f) * texelSize;
        const vec2 uv2 = (coord2 + 0.5f) * texelSize;
        const vec2 uv3 = (coord3 + 0.5f) * texelSize;
        const vec2 uv4 = (coord4 + 0.5f) * texelSize;

        const vec3 color0 = texture(inHdrColor, uv1).xyz;
        const vec3 color1 = texture(inHdrColor, uv2).xyz;
        const vec3 color2 = texture(inHdrColor, uv3).xyz;
        const vec3 color3 = texture(inHdrColor, uv4).xyz;

        Tile[gl_LocalInvocationIndex]                               = color0;
        Tile[gl_LocalInvocationIndex + TILE_DIM * TILE_DIM / 4]     = color1;
        Tile[gl_LocalInvocationIndex + TILE_DIM * TILE_DIM / 2]     = color2;
        Tile[gl_LocalInvocationIndex + TILE_DIM * TILE_DIM * 3 / 4] = color3;
    }
    
    groupMemoryBarrier();
    barrier();
    
    //...

    // Start Evaluate History.
    float wsum = 0.0f;
    vec3 vsum = vec3(0.0f, 0.0f, 0.0f);
    vec3 vsum2 = vec3(0.0f, 0.0f, 0.0f);
    for (float y = -RADIUS; y <= RADIUS; ++y)
    {
        for (float x = -RADIUS; x <= RADIUS; ++x)
        {
            const vec3 neigh = Tap(tilePos + vec2(x, y), texelSize);
            const float w = exp(-3.0f * (x * x + y * y) / ((RADIUS + 1.0f) * (RADIUS + 1.0f)));
            vsum2 += neigh * neigh * w;
            vsum += neigh * w;
            wsum += w;
        }
    }
        
    // NVidia variance clip.
    const vec3 ex = vsum / wsum;
    const vec3 ex2 = vsum2 / wsum;
    const vec3 dev = sqrt(max(ex2 - ex * ex, 0.0f));
    bool isSkyPixel;
    const vec2 velocity = getClosestVelocity(uv, texelSize, isSkyPixel);
    const float boxSize = mix(0.5f, 2.5f, isSkyPixel ? 0.0f : smoothstep(0.02f, 0.0f, length(velocity)));
    const vec3 nmin = ex - dev * boxSize;
    const vec3 nmax = ex + dev * boxSize;
    const vec3 history = sampleHistory(uv - velocity, texelSize);
    const vec3 clampedHistory = clamp(history, nmin, nmax);
}

clamp 历史像素时，使用Nvidia的variance clip算法可以提高clamp的准确度，使得通过测试的历史像素基本接近当前帧的像素。（另外UE4建议在Ycocg空间下做，但我测试和直接在RGB下做的效果似乎没有差别）。

TAA Flickering

我把它分为静态时Flicking和动态Flicking。

首先一定要把HistoryBuffer的采样器设为Billinear。

静态时Flicking超难解决。

理论上来说，使用ClampHistory必定会剔除掉次像素抖动时出现的差异颜色的情况。但实际上我们需要这部分颜色，它又被剔除掉了，导致最终TAA累积时没办法累加上这部分值，就会出现闪烁现象。

减少混合时的权重值可以显著减低Flicker，但不能贪多，一多就会模糊了。

我的做法是，在相机停止移动后，计算停止移动的帧数，算出一个cameraStopFactor，之后，在着色器中对静态的情况进行判断，如果为静帧则使用cameraStopFactor计算一个过渡值，用来缩减抖动的大小。

这样，在相机从移动中到停止移动后，flicker程度会慢慢减少，在画面中显示就是噪点逐渐变少。

const bool bStatic = lenVelocity < adoptVelocityDiff && !isSkyPixel && depthDiffH < adoptDepthDiff;
if(bStatic)
{
    float cameraStop = pushConstants.cameraStopFactor;
    cameraStop = smoothstep(0.0f,1.0f,cameraStop);
    float MinBlendFactor = max(1.0f / 16.0f * 0.15f, blendFactor * 0.15f);
    blendFactor = mix(blendFactor, MinBlendFactor, cameraStop);
}

动态时的Flicking。

因为TAA一般在Tonemapper Pass前， HDR环境下，在远处的物体（特别是小于一个像素的），TAA Jitter很容易就会使得当前帧的渲染结果和历史帧的亮度差异过大被ClampHistory排除掉，导致两帧之间亮度差异极大形成Flicking。

在UE4.26.2中，为了解决Flicking，在Depth采样和SceneColor采样时都做了一定的（3x3的模糊核）模糊处理，确保ClampHistory时的采样数据尽可能是平滑的,这在一定程度上可以缓解TAA的Flicking问题。

更好的做法是在ClampHistory前就对SceneColor做一次Tonemapper，将HDR范围映射到LDR（0 - 1.0），这样颜色/亮度差距会进一步减少，ClampHistory的结果也相对的准确，在TAA解析完后再逆Tonemapper回HDR。

但无论是ToneMapper还是Blur处理，最终都会导致画面变模糊。

TAA 模糊问题

我把它分为静态时模糊和动态时模糊：

静态时模糊容易解决，我们采样HistoryBuffer时，用更好的采样器（CatmullRom）。

vec3 sampleHistoryCatmullRom(in vec2 uv, in vec2 texelSize)
{
    vec2 samplePos = uv / texelSize;
    vec2 texPos1 = floor(samplePos - 0.5f) + 0.5f;
    vec2 f = samplePos - texPos1;
    vec2 w0 = f * (-0.5f + f * (1.0f - 0.5f * f));
    vec2 w1 = 1.0f + f * f * (-2.5f + 1.5f * f);
    vec2 w2 = f * (0.5f + f * (2.0f - 1.5f * f));
    vec2 w3 = f * f * (-0.5f + 0.5f * f);
    vec2 w12 = w1 + w2;
    vec2 offset12 = w2 / (w1 + w2);
    vec2 texPos0 = texPos1 - 1.0f;
    vec2 texPos3 = texPos1 + 2.0f;
    vec2 texPos12 = texPos1 + offset12;
    texPos0 *= texelSize;
    texPos3 *= texelSize;
    texPos12 *= texelSize;
    vec3 result = vec3(0.0f, 0.0f, 0.0f);
    result += texture(inHistory, vec2(texPos0.x, texPos0.y)).xyz * w0.x * w0.y;
    result += texture(inHistory, vec2(texPos12.x, texPos0.y)).xyz * w12.x * w0.y;
    result += texture(inHistory, vec2(texPos3.x, texPos0.y)).xyz * w3.x * w0.y;
    result += texture(inHistory, vec2(texPos0.x, texPos12.y)).xyz * w0.x * w12.y;
    result += texture(inHistory, vec2(texPos12.x, texPos12.y)).xyz * w12.x * w12.y;
    result += texture(inHistory, vec2(texPos3.x, texPos12.y)).xyz * w3.x * w12.y;
    result += texture(inHistory, vec2(texPos0.x, texPos3.y)).xyz * w0.x * w3.y;
    result += texture(inHistory, vec2(texPos12.x, texPos3.y)).xyz * w12.x * w3.y;
    result += texture(inHistory, vec2(texPos3.x, texPos3.y)).xyz * w3.x * w3.y;
    return max(result, 0.0f);
}

对于动态时模糊，原因在于深度不同的物体在相机移动时的velocity不同，离相机更远的物体，在移动时的velocity会更小，ClampHistory的结果基本都是Pass，导致该像素周围的一圈像素都被统计进去了，强行做了一圈Blur。

见下图：做了blendFactor计算和没做blendFactor计算的区别：

解决办法是给mix factor加入深度与速度的考虑：

// note: use velocity * linear depth as blend factor to solve blur problem when moving.
    float blendFactor = 1.0f;
    {   
        const float threshold   = 0.5f;
        const float base        = 0.5f;
        const float gather      = 0.1666;

        float depth             = getLinearDepth(uv);
        float texel_vel_mag     = length(velocity * vec2(dims.xy)) * depth;
        float subpixel_motion   = clamp(threshold / (texel_vel_mag + FLT_MIN), 0.0f, 1.0f);
        blendFactor *= texel_vel_mag * base + subpixel_motion * gather;

        vec3 color_history = clampedHistory;
        vec3 color_current = center;
        float luminance_history     = luminance(color_history);
        float luminance_current     = luminance(color_current);
        float unbiased_difference   = abs(luminance_current - luminance_history) / ((max(luminance_current, luminance_history) + 0.3));
        blendFactor *= 1.0 - unbiased_difference;

        // Clamp
        blendFactor = clamp(blendFactor, g_taa_blend_min, g_taa_blend_max);
    }
	// vec3 result = mix(clampedHistory, color, 1.0f / 16.0f);
    vec3 result = mix(clampedHistory, color, blendFactor);