wgpu/docs/api-specs/mesh_shading.md

# Mesh Shader Extensions

🧪Experimental🧪

`wgpu` supports an experimental version of mesh shading when `Features::EXPERIMENTAL_MESH_SHADER` is enabled.
The status of the implementation is documented in [the mesh-shading issue](https://github.com/gfx-rs/wgpu/issues/7197).

**Note**: The features documented here may have major bugs in them and are expected to be subject
to breaking changes. Suggestions for the API exposed by this should be posted on the issue above.

## Mesh shaders overview

### What are mesh shaders?

Mesh shaders are a new kind of rasterization pipeline intended to address some of the shortfalls with the vertex shader pipeline. The core idea of mesh shaders is that the GPU decides how to render the many small parts of a scene instead of the CPU issuing a draw call for every small part or issuing an inefficient monolithic draw call for a large part of the scene.

Mesh shaders are specifically designed to be used with **meshlet rendering**, a technique where every object is split into many subobjects called meshlets that are each rendered with their own parameters. With the standard vertex pipeline, each draw call specifies an exact number of primitives to render and the same parameters for all vertex shaders on an entire object (or even multiple objects). This doesn't leave room for different LODs for different parts of an object, for example a closer part having more detail, nor does it allow culling smaller sections (or primitives) of objects. With mesh shaders, each task workgroup might get assigned to a single object. It can then analyze the different meshlets(sections) of that object, determine which are visible and should actually be rendered, and for those meshlets determine what LOD to use based on the distance from the camera. It can then dispatch a mesh workgroup for each meshlet, with each mesh workgroup then reading the data for that specific LOD of its meshlet, determining which and how many vertices and primitives to output, determining which remaining primitives need to be culled, and passing the resulting primitives to the rasterizer.

Mesh shaders are most effective in scenes with many polygons. They can allow skipping processing of entire groups of primitives that are facing away from the camera or otherwise occluded, which reduces the number of primitives that need to be processed by more than half in most cases, and they can reduce the number of primitives that need to be processed for more distant objects. Scenes that are not bottlenecked by geometry (perhaps instead by fragment processing or post processing) will not see much benefit from using them.

Mesh shaders were first shown off in [NVIDIA's asteroids demo](https://www.youtube.com/watch?v=CRfZYJ_sk5E). Now, they form the basis for [Unreal Engine's Nanite](https://www.unrealengine.com/en-US/blog/unreal-engine-5-is-now-available-in-preview#Nanite).

### Mesh shader pipeline

With the current pipeline set to a mesh pipeline, a draw command like
`render_pass.draw_mesh_tasks(x, y, z)` takes the following steps:

* If the pipeline has a task shader stage:

    * Dispatch a grid of task shader workgroups, where `x`, `y`, and `z` give
      the number of workgroups along each axis of the grid. Each task shader
      workgroup produces a mesh shader workgroup grid size `(mx, my, mz)` and a
      task payload value `mp`.

    * For each task shader workgroup, dispatch a grid of mesh shader workgroups,
      where `mx`, `my`, and `mz` give the number of workgroups along each axis
      of the grid. Pass `mp` to each of these workgroup's mesh shader
      invocations.

* Alternatively, if the pipeline does not have a task shader stage:

    * Dispatch a single grid of mesh shader workgroups, where `x`, `y`, and `z`
      give the number of workgroups along each axis of the grid. These mesh
      shaders receive no task payload value.

* Each mesh shader workgroup produces a list of output vertices, and a list of
  primitives built from those vertices. The workgroup can supply per-primitive
  values as well, if needed. Each primitive selects its vertices by index, like
  an indexed draw call, from among the vertices generated by this workgroup.

  Unlike a grid of ordinary compute shader workgroups collaborating to build
  vertex and index data in common storage buffers, the vertices and primitives
  produced by a mesh shader workgroup are entirely private to that workgroup,
  and are not accessible by other workgroups.

* Primitives produced by a mesh shader workgroup can have a culling flag. If a
  primitive's culling flag is false, it is skipped during rasterization.

* The primitives produced by all mesh shader workgroups are then rasterized in
  the usual way, with each fragment shader invocation handling one pixel.

  Attributes from the vertices produced by the mesh shader workgroup are
  provided to the fragment shader with interpolation applied as appropriate.

  If the mesh shader workgroup supplied per-primitive values, these are
  available to each primitive's fragment shader invocations. Per-primitive
  values are never interpolated; fragment shaders simply receive the values
  the mesh shader workgroup associated with their primitive.

## `wgpu` API

### New `wgpu` functions

`Device::create_mesh_pipeline` - Creates a mesh shader pipeline. This is very similar to creating a standard render pipeline, except that it takes a mesh shader state and optional task shader state instead of a vertex state. If the task state is omitted, during rendering the number of workgroups is passed directly from the draw call to the mesh shader state, with an empty payload.

`RenderPass::draw_mesh_tasks` - Dispatches the mesh shader pipeline. This ignores render pipeline specific information, such as vertex buffer bindings and index buffer bindings. The dispatch size must adhere to the limits described below.

`RenderPass::draw_mesh_tasks_indirect`, `RenderPass::multi_draw_mesh_tasks_indirect` and `RenderPass::multi_draw_mesh_tasks_indirect_count` - Dispatches the mesh shader pipeline with dispatch size taken from a buffer. This ignores render pipeline specific information, such as vertex buffer bindings and index buffer bindings. The dispatch size must adhere to the limits described below. Analogous to `draw_indirect`, `multi_draw_indirect` and `multi_draw_indirect_count`. Requires the corresponding indirect feature to be enabled.

An example of using mesh shaders to render a single triangle can be seen [here](../../examples/features/src/mesh_shader).

### Features
* Using mesh shaders requires enabling `Features::EXPERIMENTAL_MESH_SHADER`.
* Using mesh shaders with multiview requires enabling `Features::EXPERIMENTAL_MESH_SHADER_MULTIVIEW`.
* Using mesh shaders with point primitives requires enabling `Features::EXPERIMENTAL_MESH_SHADER_POINTS`.
* Queries are unsupported
* Primitive index support will be added once support lands in for them in general.

### Limits

> **NOTE**: More limits will be added when support is added to `naga`.

* `Limits::max_task_workgroup_total_count` - the maximum total number of workgroups from a `draw_mesh_tasks` command or similar. The dimensions passed must be less than or equal to this limit when multiplied together.
* `Limits::max_task_workgroups_per_dimension` - the maximum for each of the 3 workgroup dimensions in a `draw_mesh_tasks` command. Each dimension passed must be less than or equal to this limit.
* `max_mesh_multiview_count` - The maximum number of views used when multiview rendering with a mesh shader pipeline.
* `max_mesh_output_layers` - the maximum number of output layers for a mesh shader pipeline.

## Naga implementation

### Supported frontends
* 🛠️ WGSL
* ❌ SPIR-V
* 🚫 GLSL

### Supported backends
* 🛠️ SPIR-V
* 🛠️ HLSL
* ❌ MSL
* 🚫 GLSL
* 🚫 WGSL

✔️ = Complete
🛠️ = In progress
❌ = Planned
🚫 = Unplanned/impossible

## `WGSL` extension specification

The majority of changes relating to mesh shaders will be in WGSL and `naga`.

Using any of these features in a `wgsl` program will require adding the `enable wgpu_mesh_shader;` directive to the top of a program.

Two new shader stages will be added to `WGSL`. Fragment shaders are also modified slightly. Both task shaders and mesh shaders are allowed to use any compute-available functionality, including subgroup operations.

### Task shader

A function with the `@task` attribute is a **task shader entry point**. A mesh shader pipeline may optionally specify a task shader entry point, and if it does, mesh draw commands using that pipeline dispatch a **task shader grid** of workgroups running the task shader entry point. Like compute shader dispatches, the three-component size passed to `draw_mesh_tasks`, or drawn from the indirect buffer for its indirect variants, specifies the size of the task shader grid as the number of workgroups along each of the grid's three axes.

A task shader entry point must have a `@workgroup_size` attribute, meeting the same requirements as one appearing on a compute shader entry point.

A task shader entry point must also have a `@payload(G)` property, where `G` is the name of a global variable in the `task_payload` address space. Each task shader workgroup has its own instance of this variable, visible to all invocations in the workgroup. Whatever value the workgroup collectively stores in that global variable becomes the **task payload**, and is provided to all invocations in the mesh shader grid dispatched for the workgroup. A task payload variable must be at least 4 bytes in size.

A task shader entry point must return a `vec3<u32>` value. The return value of each workgroup's first invocation (that is, the one whose `local_invocation_index` is `0`) is taken as the size of a **mesh shader grid** to dispatch, measured in workgroups. (If the task shader entry point returns `vec3(0, 0, 0)`, then no mesh shaders are dispatched.) Mesh shader grids are described in the next section.

Each task shader workgroup dispatches an independent mesh shader grid: in mesh shader invocations, `@builtin` values like `workgroup_id` and `global_invocation_id` describe the position of the workgroup and invocation within that grid;
and `@builtin(num_workgroups)` matches the task shader workgroup's return value. Mesh shaders dispatched for other task shader workgroups are not included in the count. If it is necessary for a mesh shader to know which task shader workgroup dispatched it, the task shader can include its own workgroup id in the task payload.

Task shaders must return a value of type `vec3<u32>` decorated with `@builtin(mesh_task_size)`.

Task shaders can use compute and subgroup builtin inputs, in addition to `view_index` and `draw_id`.

### Mesh shader

A function with the `@mesh` attribute is a **mesh shader entry point**. Mesh shaders must not return anything.

Like compute shaders, mesh shaders are invoked in a grid of workgroups, called a **mesh shader grid**. If the mesh shader pipeline has a task shader, then each task shader workgroup determines the size of a mesh shader grid to be dispatched, as described above. Otherwise, the three-component size passed to `draw_mesh_tasks`, or drawn from the indirect buffer for its indirect variants, specifies the size of the mesh shader grid directly, as the number of workgroups along each of the grid's three axes.

If the mesh shader pipeline has a task shader entry point, then the pipeline's mesh shader entry point must also have a `@payload(G)` attribute, and the sizes of the variables must match. Mesh shader invocations can read from, but not write to, this variable, which is initialized to whatever value was written to it by the task shader workgroup that dispatched this mesh shader grid.

If the mesh shader pipeline does not have a task shader entry point, then the mesh shader entry point must not have any `@payload` attribute.

A mesh shader entry point must have the following attributes:

- `@workgroup_size`: this has the same meaning as when it appears on a compute shader entry point.

- `@mesh(VAR)`: Here, `VAR` represents a workgroup variable storing the output information.

All mesh shader outputs are per-workgroup, and taken from the workgroup variable specified above. The type must have exactly 4 fields:
- A field decorated with `@builtin(vertex_count)`, with type `u32`: this field represents the number of vertices that will be drawn
- A field decorated with `@builtin(primitive_count)`, with type `u32`: this field represents the number of primitives that will be drawn
- A field decorated with `@builtin(vertices)`, typed as an array of `V`, where `V` is the vertex output type as specified below
- A field decorated with `@builtin(primitives)`, typed as an array of `P`, where `P` is the primitive output type as specified below

For a vertex count `NV`, the first `NV` elements of the vertex array above are outputted. Therefore, `NV` must be less than or equal to the size of the vertex array. The same is true for primitives with `NP`.

The vertex output type `V` must meet the same requirements as a struct type returned by a `@vertex` entry point: all members must have either `@builtin` or `@location` attributes, there must be a `@builtin(position)`, and so on.

The primitive output type `P` must be a struct type, every member of which either has a `@location` or `@builtin` attribute. All members decorated with `@location` must also be decorated with `@per_primitive`, as must the corresponding fragment input. The `@per_primitive` decoration may only be applied to members decorated with `@location`. The following `@builtin` attributes are allowed:

- `triangle_indices`, `line_indices`, or `point_index`: The annotated member must be of type `vec3<u32>`, `vec2<u32>`, or `u32`.

    The member's components are indices (or, its value is an index) into the list of vertices generated by this workgroup, identifying the vertices of the primitive to be drawn. These indices must be less than the value of `numVertices` passed to `setMeshOutputs`.

    The type `P` must contain exactly one member with one of these attributes, determining what sort of primitives the mesh shader generates.

- `cull_primitive`: The annotated member must be of type `bool`. If it is true, then the primitive is skipped during rendering.

The `@location` attributes of `P` and `V` must not overlap, since they are merged to produce the user-defined inputs to the fragment shader.

Mesh shaders may write to the `primitive_index` builtin. This is treated just like a field decorated with `@location`, so if the mesh shader outputs `primitive_index` the fragment shader must input it, and if the fragment shader inputs it, the mesh shader must write it (unlike vertex shader pipelines).

Mesh shaders can use compute and mesh shader builtin inputs, in addition to `view_index`, and if no task shader is present, `draw_id`.

### Fragment shader

Fragment shaders can access vertex output data as if it is from a vertex shader. They can also access primitive output data, provided the input is decorated with `@per_primitive`. The `@per_primitive` decoration may only be applied to inputs or struct members decorated with `@location`.

The primitive state is part of the fragment input and must match the output of the mesh shader in the pipeline. Using `@per_primitive` also requires enabling the mesh shader extension. Additionally, the locations of vertex and primitive input cannot overlap.

### Full example

The following is a full example of WGSL shaders that could be used to create a mesh shader pipeline, showing off many of the features.

```wgsl
enable wgpu_mesh_shader;

const positions = array(
    vec4(0., 1., 0., 1.),
    vec4(-1., -1., 0., 1.),
    vec4(1., -1., 0., 1.)
);
const colors = array(
    vec4(0., 1., 0., 1.),
    vec4(0., 0., 1., 1.),
    vec4(1., 0., 0., 1.)
);

struct TaskPayload {
    colorMask: vec4<f32>,
    visible: bool,
}
struct VertexOutput {
    @builtin(position) position: vec4<f32>,
    @location(0) color: vec4<f32>,
}
struct PrimitiveOutput {
    @builtin(triangle_indices) indices: vec3<u32>,
    @builtin(cull_primitive) cull: bool,
    @per_primitive @location(1) colorMask: vec4<f32>,
}
struct PrimitiveInput {
    @per_primitive @location(1) colorMask: vec4<f32>,
}

var<task_payload> taskPayload: TaskPayload;
var<workgroup> workgroupData: f32;

@task
@payload(taskPayload)
@workgroup_size(1)
fn ts_main() -> @builtin(mesh_task_size) vec3<u32> {
    workgroupData = 1.0;
    taskPayload.colorMask = vec4(1.0, 1.0, 0.0, 1.0);
    taskPayload.visible = true;
    return vec3(1, 1, 1);
}

struct MeshOutput {
    @builtin(vertices) vertices: array<VertexOutput, 3>,
    @builtin(primitives) primitives: array<PrimitiveOutput, 1>,
    @builtin(vertex_count) vertex_count: u32,
    @builtin(primitive_count) primitive_count: u32,
}
var<workgroup> mesh_output: MeshOutput;

@mesh(mesh_output)
@payload(taskPayload)
@workgroup_size(1)
fn ms_main(@builtin(local_invocation_index) index: u32, @builtin(global_invocation_id) id: vec3<u32>) {
    mesh_output.vertex_count = 3;
    mesh_output.primitive_count = 1;
    workgroupData = 2.0;

    mesh_output.vertices[0].position = positions[0];
    mesh_output.vertices[0].color = colors[0] * taskPayload.colorMask;

    mesh_output.vertices[1].position = positions[1];
    mesh_output.vertices[1].color = colors[1] * taskPayload.colorMask;

    mesh_output.vertices[2].position = positions[2];
    mesh_output.vertices[2].color = colors[2] * taskPayload.colorMask;

    mesh_output.primitives[0].indices = vec3<u32>(0, 1, 2);
    mesh_output.primitives[0].cull = !taskPayload.visible;
    mesh_output.primitives[0].colorMask = vec4<f32>(1.0, 0.0, 1.0, 1.0);
}

@fragment
fn fs_main(vertex: VertexOutput, primitive: PrimitiveInput) -> @location(0) vec4<f32> {
    return vertex.color * primitive.colorMask;
}
```