16. Instancing의 응용과 한계

이 챕터에서 정복할 CG 개념

LOD는 인스턴스마다 카메라 거리로 mesh 단순도를 선택하는 디스패치 결정이다.
Frustum culling은 인스턴스 단위 가시성 비트 attribute를 만들고, 보이지 않는 것을 빼는 작업이다.
Compaction (살아남은 인스턴스만 남기기)은 본질적으로 prefix sum (exclusive scan)이다.
거리순 정렬은 GPU radix sort 또는 bitonic sort. Sort POP은 그 노드 한 줄짜리 래퍼다.
한계: instancing은 draw call을 줄이지만 vertex 작업·메모리 대역폭은 그대로다. 그래서 LOD·culling이 필요하다.

왜 이게 중요한가

15장의 1024 박스가 100만 개가 되면 draw call 한 번이라도 vertex shader는 mesh의 모든 vertex × 100만 번 돈다. 카메라 뒤의 인스턴스도, 멀리 한 픽셀에 모인 인스턴스도 모두. 16장의 질문: 어느 인스턴스가 화면에 보이는가, 어느 단순도의 mesh로 그릴 것인가, 반투명이면 어느 순서로 그릴 것인가. 답의 공통 인프라는 dispatch + prefix sum이다. POP 측에서는 GLSL POP / Advanced POP이 그 dispatch를 직접 쓸 자리, Sort POP과 Delete POP이 노드 형태의 답이다.

POP에서의 노출 지점

GLSL POP: per-instance distance, frustum test 결과, LOD 인덱스 같은 attribute 작성.
GLSL Advanced POP: element 수 변경 (Output 페이지의 pointcountmode, maxpoints). compaction 결과의 압축 buffer 출력 가능.
Sort POP: point/primitive list를 P 또는 임의 attribute 성분으로 정렬. wiki: "It implements https://gpuopen.com/fidelityfx-parallel-sort/" — AMD FidelityFX Parallel Sort = GPU radix sort. 파라미터: pointrev, primrev, Shift.
Delete POP: Attribute / Thin / Pattern / Group / Bounding 다섯 모드. Attribute 모드는 <, <=, >, >=, ==, !=와 And/Or/Xor/Not And/Not Or 결합 지원.
GLSL MAT: TDInstanceID()로 살아남은 인스턴스의 transform·color·custom attribute 접근.

이론

LOD: per-instance distance → mesh index

d_i = length(P_i - C)로 거리를 재고 임계값으로 lod_i ∈ {0,1,2}를 정한다. GLSL MAT 안에서 mesh를 바꾸기 어려우므로 실용적 패턴은: GLSL POP이 lod attribute를 만든다 → lod별로 부분집합을 만들어 (Delete POP 세 번 또는 Advanced POP의 compaction) → 세 개의 Geometry COMP에 각각 다른 단순도의 mesh를 둔다 (geodesic frequency 4 / 2 / 1 식).

Frustum culling: clip-space 검사

clip = P_proj * V * vec4(P_i, 1.0), visible = all(abs(clip.xyz) <= clip.w). clip.w로 나누면 NDC가 되고 [-1,+1] 안이면 frustum 내부 — 위 식은 나눗셈 없이 같은 판정. mesh의 AABB가 있으면 8 코너를 검사해야 정확하다. 인스턴스 단위는 보통 점 한 개 또는 bounding sphere center로 근사한다.

Compaction: prefix sum

visible_i ∈ {0,1} 비트 배열에서 살아남은 인스턴스만 모은 출력 인덱스는 exclusive scan으로 얻는다.

input visible : 1 0 1 1 0 1 0 1
exclusive scan: 0 1 1 2 3 3 4 4
출력 길이 = scan[7] + visible[7] = 4 + 1 = 5

visible_i==1인 자리만 out[scan[i]] = in[i]로 옮기면 압축이 끝난다. prefix sum은 work-efficient 알고리즘 (Hillis-Steele, Blelloch)이 표준이다. Harris/Sengupta/Owens의 GPU Gems 3 Ch.39가 표준 참고.

Sorting: 반투명 인스턴스의 거리순

GPU sort의 두 표준 — Radix sort: 키의 비트 단위로 패스마다 0/1 split + prefix scan. log2(bits) 패스. Sort POP 내부 (wiki: FidelityFX Parallel Sort). Bitonic sort: 비교 네트워크. log²(N) 패스. 투명도 정렬에는 거리 키 32-bit float면 충분하다.

한계

instancing이 줄이는 비용은 draw call이지 vertex 작업도 픽셀 작업도 아니다. 따라서 큰 N에서 병목은 다음으로 옮겨 간다.

vertex shader 시간 → LOD로 mesh 단순화.
fillrate / overdraw → frustum culling, occlusion culling, 정렬.
메모리 대역폭 → attribute 압축, 16-bit float.

per-instance branching의 비용도 크다. SIMT 그룹 안에서 인스턴스마다 다른 분기를 타면 양쪽이 다 실행된다. LOD를 GLSL MAT 안의 if-else 대신 인스턴스 그룹 분리로 처리하는 이유가 이것이다.

손작업 (Hands-on)

시작 파일

15장의 1024-박스 그리드를 이어 쓴다.

Box POP1 (mesh).
Grid POP1 (1024 점).
Geometry COMP1 (instancing on, instanceop=/grid1, instancetx/y/z=P).
Camera COMP1.
GLSL MAT1 + vertex/pixel shader.

단계 1 — per-instance distance와 LOD attribute

Grid POP1과 Geometry COMP 사이에 GLSL POP1을 끼운다. Attribute Class: point. Output Attributes: lod. Create Attribs: lod, int 1컴포넌트. Vectors 페이지에 camPos vec3을 추가하고 Camera COMP의 위치를 expression으로 바인딩한다 (wiki 미확정 — 정확한 expression 형태는 실험 필요).

uniform vec3 camPos;
uniform float dNear;  // 예: 3.0
uniform float dFar;   // 예: 7.0

void main() {
    const uint id = TDIndex();
    if (id >= TDNumElements()) return;

    float d = length(TDIn_P() - camPos);
    int l = (d < dNear) ? 0 : (d < dFar ? 1 : 2);
    lod[id] = l;
}

이 코드가 GPU에서 하는 일: 1024 thread가 각자 한 점의 거리를 재고 두 임계값으로 분류해 lod SSBO에 정수를 쓴다. 같은 워크그룹의 thread가 보통 비슷한 거리이므로 SIMT 발산은 작다.

시각화: Geometry COMP의 Instance 3 페이지에서 instance0customop=glsl_pop1, instance0customx=lod(0)로 lod를 custom attribute 0번에 매핑한다. fragment에서:

vec3 lodCol[3] = vec3[3](
    vec3(1.0, 0.3, 0.3),  // LOD 0
    vec3(0.3, 1.0, 0.3),  // LOD 1
    vec3(0.3, 0.3, 1.0)   // LOD 2
);
int l = clamp(int(TDInstanceCustomAttrib0().x), 0, 2);
fragColor = TDOutputSwizzle(vec4(lodCol[l], 1.0));

보게 될 것: 카메라 가까이 빨강, 중간 초록, 멀리 파랑. 카메라가 움직이면 분류가 갱신된다.

단계 2 — Frustum culling: visibility bit

GLSL POP2 (point class, Output Attributes: visible). Matrices 페이지에 viewProj 4×4 — Camera COMP의 view·projection 합성을 expression으로 (wiki 미확정 — 정확한 멤버명·expression은 실험 필요).

uniform mat4 viewProj;

void main() {
    const uint id = TDIndex();
    if (id >= TDNumElements()) return;

    vec4 clip = viewProj * vec4(TDIn_P(), 1.0);
    bool inside = all(lessThanEqual(abs(clip.xyz), vec3(clip.w)));
    visible[id] = inside ? 1 : 0;
}

이 코드가 GPU에서 하는 일: vertex shader가 gl_Position을 만드는 것과 같은 곱셈을 compute 측에서 한다. clip 좌표가 -w..+w 안이면 frustum 안. 인스턴스가 mesh이면 bounding sphere 반경 보정이 필요하다.

단계 3 — Culling 결과 적용

경로 A — Delete POP: GLSL POP2 → Delete POP. Mode: Attribute, attribute visible, condition == 0. 결과는 보이는 점만 남고, 이 출력을 Geometry COMP의 instanceop에 꽂는다.

경로 B — GLSL Advanced POP: Output 페이지의 pointcountmode=set으로 point count를 동적으로 줄인다. multi-pass — pass 1: visible의 exclusive scan을 prefix attribute에 채움. pass 2: scatter.

// pass 2 (scatter) 개념
void main() {
    const uint id = TDIndex();
    if (id >= TDNumElements()) return;

    uint v = uint(TDInPoint_visible(0, id));
    if (v == 0u) return;

    uint outIdx = uint(TDInPoint_prefix(0, id));
    oTDPoint_P[outIdx] = TDInPoint_P(0, id);
}

실제 scan은 Harris/Sengupta/Owens의 up-sweep / down-sweep 알고리즘 (shared memory)이다. wiki 미확정 — Advanced POP의 Passes / Copy-Previous-Pass-Output-to-Input이 이 패턴에 정확히 매핑되는 방식은 glsl_pop_deep.md §6에 wiki 충돌이 있음. 실험 필요.

보게 될 것: 카메라를 회전시켜 grid가 화면 밖으로 나가면 그 부분의 박스가 사라진다. draw call은 여전히 한 번이지만 vertex 수가 줄어든다.

단계 4 — Sort POP으로 거리순 정렬

GLSL POP3 (Output Attributes: dist)에서 dist[id] = length(TDIn_P() - camPos); 한 줄. 다음에 Sort POP1, attribute dist, pointrev On이면 멀리부터 (back-to-front). 출력을 Geometry COMP의 instanceop에 꽂는다.

보게 될 것: 반투명 박스라면 정렬 효과가 명백하다. Constant MAT Alpha 0.5로 정렬 전/후를 비교하면 깊이 오류가 사라진다.

노드 ↔︎ GLSL 매핑

Culling — 노드: Grid POP → GLSL POP (visible) → Delete POP (visible==0) → Geometry COMP의 instanceop. Culling — Advanced POP: Grid POP → GLSL Advanced POP (scan + scatter, point count 변경) → Geometry COMP의 instanceop. 두 노드가 한 노드로 합쳐지지만 셰이더가 길어진다.

Sort — 노드: Sort POP 한 줄. 내부는 GPU radix sort.

Manual radix sort 의사 코드 (32-bit key, 8-bit 4 pass):

// per pass:
//   1) histogram: 각 thread가 키의 해당 byte → 256-bin 히스토그램
//   2) prefix scan: 히스토그램을 exclusive scan → bin별 시작 위치
//   3) scatter: scan 결과로 데이터를 출력 위치에 재배치
// pass 4번 → 32-bit 다 정렬

FidelityFX Parallel Sort는 위 의사 코드를 GPU 최적화한 구현이다. Sort POP은 이 100줄을 노드 한 줄로 압축한다.

확인 질문 (Self-check)

instancing이 vertex shader 호출 수를 줄이지 않는다면, 인스턴스 수가 커질 때의 다음 병목은 어디인가? 각 병목에 대응하는 기법은 무엇인가?
exclusive prefix scan이 compaction에 필요한 이유는? scan의 출력값이 무엇을 의미하는가?
frustum 안인지 판정하는 식 abs(clip.xyz) <= vec3(clip.w)가 왜 NDC 비교(/w 후 [-1,+1])와 같은가?
Sort POP의 내부 알고리즘이 GPU radix sort라는 사실이 노드를 쓰는 사용자에게 무엇을 의미하는가 (성능 특성, N에 대한 스케일링)?
Advanced POP으로 element 수를 줄였을 때 unchanged input attribute의 reference pass-through가 어떻게 되는가? (glsl_pop_deep.md §4 마지막 인용 참조)

연결 고리

GPU Gems 3 Ch.39 "Parallel Prefix Sum (Scan) with CUDA" (Harris, Sengupta, Owens) — compaction의 핵심 알고리즘.
GPU Gems 3 Ch.1 "Generating Complex Procedural Terrains Using the GPU" (Geiss) — 백만 개 인스턴스를 GPU dispatch + cull로 다루는 구체적 예.
Real-Time Rendering 4ed §18 "Pipeline Optimization": 어디가 병목이고 어디를 줄여야 하는지의 진단표.
Real-Time Rendering 4ed §19 "Acceleration Algorithms": LOD, frustum/occlusion culling, BVH의 책 측 어휘.
Real-Time Rendering 4ed §20 "Efficient Shading": deferred / clustered / Z-pre-pass와 인스턴스 수의 상호작용.

이 챕터의 한 줄 명제

LOD·frustum culling·sorting을 GPU에 옮기면 결국 dispatch와 prefix sum의 조합으로 환원된다. Sort POP과 Delete POP은 그 환원을 노드 한 줄로 노출한 인터페이스다.