Vulkan实现Sample Distribution Shadow Maps

2021年9月21日实时渲染

Sample Distribution Shadow Maps(SDSM) 和Cascade Shadow Map原理基本一模一样。但SDSM凭借其动态划分Casacde深度优势效果远胜于 CascadeShadowMap。

如下方红框所示，没有做PCSS，阴影滤波为4x4的PCF，可以看到普通的Cascade阴影纹理精度不足，找出模糊现象。

下方则是开启了SDSM，可以看到全部都是锐利精确的阴影。

要在引擎中实现SDSM，首先需要调整一下渲染管线，确保GBuffer在ShadowDepth Pass前渲染。

渲染完GBuffer后，在Gpu中对深度纹理评估最大深度值和最小深度值。

我们不需要Mipmap之类的Copy操作，直接使用shared memory在一次ComputeShader Dispatch中全部评估完毕。

在shared memory之间比对Min Max则是经典的Reduce操作，

而在多个Group之间比对则是多线程操作，所以需要用到Atomic操作，由于Glsl只支持uint类型，所以我们需要稍微改变一下深度的编码，之后再换回来：

完整着色器如下：

#version 460

#define WORK_TILE_SIZE 16

// 24 bit
// 0xFFFFFFFF --> 4,294,967,295
// SCALE_UINT_7u  4,000,000,000 + 10u
const uint SCALE_UINT_7u = 4000000000;
const float SCALE_UINT_7 = 4000000000.0f;
layout (local_size_x = WORK_TILE_SIZE,local_size_y = WORK_TILE_SIZE,local_size_z = 1) in;

layout(set = 0, binding = 0) uniform sampler2D inDepthImage;

struct DepthRange
{
    uint minDepth;
    uint maxDepth;
};

layout(set = 1, binding = 0) writeonly buffer DepthRangeBuffer
{
	DepthRange range;
};

struct PushConstantData
{
    vec2 imageSize;
};

layout(push_constant) uniform block
{
	PushConstantData pushConstant;
};

shared vec2 DepthContainer[WORK_TILE_SIZE * WORK_TILE_SIZE]; // .x is min, .y is max

// TODO: use subgroup.
void main()
{
	uvec2 pos = gl_GlobalInvocationID.xy;

    uint idx = gl_GlobalInvocationID.x + gl_GlobalInvocationID.y * gl_NumWorkGroups.x * gl_WorkGroupSize.x;

	if (idx == 0)
	{
		atomicExchange(range.minDepth, SCALE_UINT_7u + 10u);
        atomicExchange(range.maxDepth, 0);
	}
    barrier();

    const uint linearIndex = gl_LocalInvocationID.y * WORK_TILE_SIZE + gl_LocalInvocationID.x;

    // we will set depth image with a point clamp edge sampler.
    // So don't care about border case here.
    vec2 sampleUV = (vec2(pos) + vec2(0.5)) / pushConstant.imageSize; 

	float depth = texture(inDepthImage, sampleUV).x;

    if(depth < 0.000001f || depth > 0.999999f)
    {
        DepthContainer[linearIndex] = vec2(SCALE_UINT_7u + 10u,0);
    }
    else
    {
       DepthContainer[linearIndex] = vec2(depth,depth);
    }

    groupMemoryBarrier();
    barrier();

    if (linearIndex < 128) // 16 * 16 = 256 elements to merge
    {
        DepthContainer[linearIndex].x = min(DepthContainer[linearIndex].x, DepthContainer[linearIndex + 128].x);
        DepthContainer[linearIndex].y = max(DepthContainer[linearIndex].y, DepthContainer[linearIndex + 128].y);
    }

    groupMemoryBarrier();
    barrier();

    if (linearIndex < 64) 
    {
        DepthContainer[linearIndex].x = min(DepthContainer[linearIndex].x, DepthContainer[linearIndex + 64].x);
        DepthContainer[linearIndex].y = max(DepthContainer[linearIndex].y, DepthContainer[linearIndex + 64].y);
    }

    groupMemoryBarrier();
    barrier();

    if (linearIndex < 32) 
    {
        DepthContainer[linearIndex].x = min(DepthContainer[linearIndex].x, DepthContainer[linearIndex + 32].x);
        DepthContainer[linearIndex].y = max(DepthContainer[linearIndex].y, DepthContainer[linearIndex + 32].y);
    }

    groupMemoryBarrier();
    barrier();

    if (linearIndex < 16) 
    {
        DepthContainer[linearIndex].x = min(DepthContainer[linearIndex].x, DepthContainer[linearIndex + 16].x);
        DepthContainer[linearIndex].y = max(DepthContainer[linearIndex].y, DepthContainer[linearIndex + 16].y);
    }

    groupMemoryBarrier();
    barrier();

    if (linearIndex < 8)
    {
        DepthContainer[linearIndex].x = min(DepthContainer[linearIndex].x, DepthContainer[linearIndex + 8].x);
        DepthContainer[linearIndex].y = max(DepthContainer[linearIndex].y, DepthContainer[linearIndex + 8].y);
    }

    groupMemoryBarrier();
    barrier();

    if (linearIndex < 4)
    {
        DepthContainer[linearIndex].x = min(DepthContainer[linearIndex].x, DepthContainer[linearIndex + 4].x);
        DepthContainer[linearIndex].y = max(DepthContainer[linearIndex].y, DepthContainer[linearIndex + 4].y);
    }

    groupMemoryBarrier();
    barrier();

    if (linearIndex < 2)
    {
        DepthContainer[linearIndex].x = min(DepthContainer[linearIndex].x, DepthContainer[linearIndex + 2].x);
        DepthContainer[linearIndex].y = max(DepthContainer[linearIndex].y, DepthContainer[linearIndex + 2].y);
    }

    groupMemoryBarrier();
    barrier();

    if (linearIndex < 1) 
    {
        float minDepthUintNear = max(0.0f, min(DepthContainer[linearIndex].x, DepthContainer[linearIndex + 1].x) * SCALE_UINT_7);
        float maxDepthUintNear = min(SCALE_UINT_7, max(DepthContainer[linearIndex].y, DepthContainer[linearIndex + 1].y) * SCALE_UINT_7 + 10.0f);

        atomicMin(range.minDepth, uint(minDepthUintNear));
        atomicMax(range.maxDepth, uint(maxDepthUintNear));
    }
}

SDSM另外一个优点则是它可以完全Gpu Driven，事实上我的引擎也是完全Gpu Driven的，在评估完最大最小值后，直接在下一个Dispatch中设置Cascade Shadow Map的各个Frustum，然后进行剔除，再利用DrawIndirect一次绘制全部的ShadowCaster。

性能Tip

由于SDSM的划分Cascade非常激进，所以每帧都要更新每一级Cascade，而普通的Cascade ShadowMap因为比较大的Cascade一般离视线中央很远，可以隔2帧或者3帧更新一次作为性能优化。所以SDSM的性能评估是低于常规的Cascade ShadowMap的。

另外，如果引擎的Shadow Depth不是Gpu Driven的，SDSM的性能堪忧，Depth Min Max的评估结果必须延后一帧，否则cpu回读的时间会拖累整个gpu渲染速度。我之前已经在UE4.26.2中实现过SDSM了，性能并不理想。（Cpu 回读的时间与 SSAO的评估时间基本一致 ~1.2 - 2.0ms）。

而UE5 则是GpuDriven的，但是他们另外实现了 Virtual Shadow Map，似乎没有太多的理由继续实现 SDSM了。