Vulkan实现Sample Distribution Shadow Maps
Sample Distribution Shadow Maps(SDSM) 和Cascade Shadow Map原理基本一模一样。但SDSM凭借其动态划分Casacde深度优势效果远胜于 CascadeShadowMap。
如下方红框所示,没有做PCSS,阴影滤波为4x4的PCF,可以看到普通的Cascade阴影纹理精度不足,找出模糊现象。
下方则是开启了SDSM,可以看到全部都是锐利精确的阴影。
要在引擎中实现SDSM,首先需要调整一下渲染管线,确保GBuffer在ShadowDepth Pass前渲染。
渲染完GBuffer后,在Gpu中对深度纹理评估最大深度值和最小深度值。
我们不需要Mipmap之类的Copy操作,直接使用shared memory在一次ComputeShader Dispatch中全部评估完毕。
在shared memory之间比对Min Max则是经典的Reduce操作,
而在多个Group之间比对则是多线程操作,所以需要用到Atomic操作,由于Glsl只支持uint类型,所以我们需要稍微改变一下深度的编码,之后再换回来:
完整着色器如下:
#version 460
#define WORK_TILE_SIZE 16
// 24 bit
// 0xFFFFFFFF --> 4,294,967,295
// SCALE_UINT_7u 4,000,000,000 + 10u
const uint SCALE_UINT_7u = 4000000000;
const float SCALE_UINT_7 = 4000000000.0f;
layout (local_size_x = WORK_TILE_SIZE,local_size_y = WORK_TILE_SIZE,local_size_z = 1) in;
layout(set = 0, binding = 0) uniform sampler2D inDepthImage;
struct DepthRange
{
uint minDepth;
uint maxDepth;
};
layout(set = 1, binding = 0) writeonly buffer DepthRangeBuffer
{
DepthRange range;
};
struct PushConstantData
{
vec2 imageSize;
};
layout(push_constant) uniform block
{
PushConstantData pushConstant;
};
shared vec2 DepthContainer[WORK_TILE_SIZE * WORK_TILE_SIZE]; // .x is min, .y is max
// TODO: use subgroup.
void main()
{
uvec2 pos = gl_GlobalInvocationID.xy;
uint idx = gl_GlobalInvocationID.x + gl_GlobalInvocationID.y * gl_NumWorkGroups.x * gl_WorkGroupSize.x;
if (idx == 0)
{
atomicExchange(range.minDepth, SCALE_UINT_7u + 10u);
atomicExchange(range.maxDepth, 0);
}
barrier();
const uint linearIndex = gl_LocalInvocationID.y * WORK_TILE_SIZE + gl_LocalInvocationID.x;
// we will set depth image with a point clamp edge sampler.
// So don't care about border case here.
vec2 sampleUV = (vec2(pos) + vec2(0.5)) / pushConstant.imageSize;
float depth = texture(inDepthImage, sampleUV).x;
if(depth < 0.000001f || depth > 0.999999f)
{
DepthContainer[linearIndex] = vec2(SCALE_UINT_7u + 10u,0);
}
else
{
DepthContainer[linearIndex] = vec2(depth,depth);
}
groupMemoryBarrier();
barrier();
if (linearIndex < 128) // 16 * 16 = 256 elements to merge
{
DepthContainer[linearIndex].x = min(DepthContainer[linearIndex].x, DepthContainer[linearIndex + 128].x);
DepthContainer[linearIndex].y = max(DepthContainer[linearIndex].y, DepthContainer[linearIndex + 128].y);
}
groupMemoryBarrier();
barrier();
if (linearIndex < 64)
{
DepthContainer[linearIndex].x = min(DepthContainer[linearIndex].x, DepthContainer[linearIndex + 64].x);
DepthContainer[linearIndex].y = max(DepthContainer[linearIndex].y, DepthContainer[linearIndex + 64].y);
}
groupMemoryBarrier();
barrier();
if (linearIndex < 32)
{
DepthContainer[linearIndex].x = min(DepthContainer[linearIndex].x, DepthContainer[linearIndex + 32].x);
DepthContainer[linearIndex].y = max(DepthContainer[linearIndex].y, DepthContainer[linearIndex + 32].y);
}
groupMemoryBarrier();
barrier();
if (linearIndex < 16)
{
DepthContainer[linearIndex].x = min(DepthContainer[linearIndex].x, DepthContainer[linearIndex + 16].x);
DepthContainer[linearIndex].y = max(DepthContainer[linearIndex].y, DepthContainer[linearIndex + 16].y);
}
groupMemoryBarrier();
barrier();
if (linearIndex < 8)
{
DepthContainer[linearIndex].x = min(DepthContainer[linearIndex].x, DepthContainer[linearIndex + 8].x);
DepthContainer[linearIndex].y = max(DepthContainer[linearIndex].y, DepthContainer[linearIndex + 8].y);
}
groupMemoryBarrier();
barrier();
if (linearIndex < 4)
{
DepthContainer[linearIndex].x = min(DepthContainer[linearIndex].x, DepthContainer[linearIndex + 4].x);
DepthContainer[linearIndex].y = max(DepthContainer[linearIndex].y, DepthContainer[linearIndex + 4].y);
}
groupMemoryBarrier();
barrier();
if (linearIndex < 2)
{
DepthContainer[linearIndex].x = min(DepthContainer[linearIndex].x, DepthContainer[linearIndex + 2].x);
DepthContainer[linearIndex].y = max(DepthContainer[linearIndex].y, DepthContainer[linearIndex + 2].y);
}
groupMemoryBarrier();
barrier();
if (linearIndex < 1)
{
float minDepthUintNear = max(0.0f, min(DepthContainer[linearIndex].x, DepthContainer[linearIndex + 1].x) * SCALE_UINT_7);
float maxDepthUintNear = min(SCALE_UINT_7, max(DepthContainer[linearIndex].y, DepthContainer[linearIndex + 1].y) * SCALE_UINT_7 + 10.0f);
atomicMin(range.minDepth, uint(minDepthUintNear));
atomicMax(range.maxDepth, uint(maxDepthUintNear));
}
}
SDSM另外一个优点则是它可以完全Gpu Driven,事实上我的引擎也是完全Gpu Driven的,在评估完最大最小值后,直接在下一个Dispatch中设置Cascade Shadow Map的各个Frustum,然后进行剔除,再利用DrawIndirect一次绘制全部的ShadowCaster。
性能Tip
由于SDSM的划分Cascade非常激进,所以每帧都要更新每一级Cascade,而普通的Cascade ShadowMap因为比较大的Cascade一般离视线中央很远,可以隔2帧或者3帧更新一次作为性能优化。所以SDSM的性能评估是低于常规的Cascade ShadowMap的。
另外,如果引擎的Shadow Depth不是Gpu Driven的,SDSM的性能堪忧,Depth Min Max的评估结果必须延后一帧,否则cpu回读的时间会拖累整个gpu渲染速度。 我之前已经在UE4.26.2中实现过SDSM了,性能并不理想。(Cpu 回读的时间与 SSAO的评估时间基本一致 ~1.2 - 2.0ms)。
而UE5 则是GpuDriven的,但是他们另外实现了 Virtual Shadow Map,似乎没有太多的理由继续实现 SDSM了。