# Mesh Shader Extensions

🧪Experimental🧪

`wgpu` supports an experimental version of mesh shading when `Features::EXPERIMENTAL_MESH_SHADER` is enabled.
Currently `naga` has no support for parsing or writing mesh shaders.
For this reason, all shaders must be created with `Device::create_shader_module_passthrough`.

**Note**: The features documented here may have major bugs in them and are expected to be subject
to breaking changes, suggestions for the API exposed by this should be posted on [the mesh-shading issue](https://github.com/gfx-rs/wgpu/issues/7197).

## Mesh shaders overview

### What are mesh shaders?

Mesh shaders are a new kind of rasterization pipeline intended to address some of the shortfalls with the vertex shader pipeline. The core idea of mesh shaders is that the GPU decides how to render the many small parts of a scene instead of the CPU issuing a draw call for every small part or issuing an inefficient monolithic draw call for a large part of the scene.

Mesh shaders are specifically designed to be used with **meshlet rendering**, a technique where every object is split into many subobjects called meshlets that are each rendered with their own parameters. With the standard vertex pipeline, each draw call specifies an exact number of primitives to render and the same parameters for all vertex shaders on an entire object (or even multiple objects). This doesn't leave room for different LODs for different parts of an object, for example a closer part having more detail, nor does it allow culling smaller sections (or primitives) of objects. With mesh shaders, each task workgroup might get assigned to a single object. It can then analyze the different meshlets(sections) of that object, determine which are visible and should actually be rendered, and for those meshlets determine what LOD to use based on the distance from the camera. It can then dispatch a mesh workgroup for each meshlet, with each mesh workgroup then reading the data for that specific LOD of its meshlet, determining which and how many vertices and primitives to output, determining which remaining primitives need to be culled, and passing the resulting primitives to the rasterizer.

Mesh shaders are most effective in scenes with many polygons. They can allow skipping processing of entire groups of primitives that are facing away from the camera or otherwise occluded, which reduces the number of primitives that need to be processed by more than half in most cases, and they can reduce the number of primitives that need to be processed for more distant objects. Scenes that are not bottlenecked by geometry (perhaps instead by fragment processing or post processing) will not see much benefit from using them.

Mesh shaders were first shown off in [NVIDIA's asteroids demo](https://www.youtube.com/watch?v=CRfZYJ_sk5E). Now, they form the basis for [Unreal Engine's Nanite](https://www.unrealengine.com/en-US/blog/unreal-engine-5-is-now-available-in-preview#Nanite).

### Mesh shader pipeline

With the current pipeline set to a mesh pipeline, a draw command like
`render_pass.draw_mesh_tasks(x, y, z)` takes the following steps:

* If the pipeline has a task shader stage:

    * Dispatch a grid of task shader workgroups, where `x`, `y`, and `z` give
      the number of workgroups along each axis of the grid. Each task shader
      workgroup produces a mesh shader workgroup grid size `(mx, my, mz)` and a
      task payload value `mp`.

    * For each task shader workgroup, dispatch a grid of mesh shader workgroups,
      where `mx`, `my`, and `mz` give the number of workgroups along each axis
      of the grid. Pass `mp` to each of these workgroup's mesh shader
      invocations.

* Alternatively, if the pipeline does not have a task shader stage:

    * Dispatch a single grid of mesh shader workgroups, where `x`, `y`, and `z`
      give the number of workgroups along each axis of the grid. These mesh
      shaders receive no task payload value.

* Each mesh shader workgroup produces a list of output vertices, and a list of
  primitives built from those vertices. The workgroup can supply per-primitive
  values as well, if needed. Each primitive selects its vertices by index, like
  an indexed draw call, from among the vertices generated by this workgroup.

  Unlike a grid of ordinary compute shader workgroups collaborating to build
  vertex and index data in common storage buffers, the vertices and primitives
  produced by a mesh shader workgroup are entirely private to that workgroup,
  and are not accessible by other workgroups.

* Primitives produced by a mesh shader workgroup can have a culling flag. If a
  primitive's culling flag is false, it is skipped during rasterization.

* The primitives produced by all mesh shader workgroups are then rasterized in
  the usual way, with each fragment shader invocation handling one pixel.

  Attributes from the vertices produced by the mesh shader workgroup are
  provided to the fragment shader with interpolation applied as appropriate.

  If the mesh shader workgroup supplied per-primitive values, these are
  available to each primitive's fragment shader invocations. Per-primitive
  values are never interpolated; fragment shaders simply receive the values
  the mesh shader workgroup associated with their primitive.

## `wgpu` API

### New `wgpu` functions

`Device::create_mesh_pipeline` - Creates a mesh shader pipeline. This is very similar to creating a standard render pipeline, except that it takes a mesh shader state and optional task shader state instead of a vertex state. If the task state is omitted, during rendering the number of workgroups is passed directly from the draw call to the mesh shader state, with an empty payload.

`RenderPass::draw_mesh_tasks` - Dispatches the mesh shader pipeline. This ignores render pipeline specific information, such as vertex buffer bindings and index buffer bindings. The dispatch size must adhere to the limits described below.

`RenderPass::draw_mesh_tasks_indirect`, `RenderPass::multi_draw_mesh_tasks_indirect` and `RenderPass::multi_draw_mesh_tasks_indirect_count` - Dispatches the mesh shader pipeline with dispatch size taken from a buffer. This ignores render pipeline specific information, such as vertex buffer bindings and index buffer bindings. The dispatch size must adhere to the limits described below. Analogous to `draw_indirect`, `multi_draw_indirect` and `multi_draw_indirect_count`. Requires the corresponding indirect feature to be enabled.

An example of using mesh shaders to render a single triangle can be seen [here](../../examples/features/src/mesh_shader).

### Features
* Using mesh shaders requires enabling `Features::EXPERIMENTAL_MESH_SHADER`.
* Using mesh shaders with multiview requires enabling `Features::EXPERIMENTAL_MESH_SHADER_MULTIVIEW`.
* Currently, only triangle rendering is tested
* Line rendering is supported but untested
* Point rendering is supported on vulkan. It is impossible on DirectX. Metal support hasn't been checked.
* Queries are unsupported

### Limits

> **NOTE**: More limits will be added when support is added to `naga`.

* `Limits::max_task_workgroup_total_count` - the maximum total number of workgroups from a `draw_mesh_tasks` command or similar. The dimensions passed must be less than or equal to this limit when multiplied together.
* `Limits::max_task_workgroups_per_dimension` - the maximum for each of the 3 workgroup dimensions in a `draw_mesh_tasks` command. Each dimension passed must be less than or equal to this limit.
* `max_mesh_multiview_count` - The maximum number of views used when multiview rendering with a mesh shader pipeline.
* `max_mesh_output_layers` - the maximum number of output layers for a mesh shader pipeline.

### Backend specific information
* Only Vulkan is currently supported.
* DirectX 12 doesn't support point rendering.
* DirectX 12 support is planned.
* Metal support is desired but not currently planned.


## Naga implementation


### Supported frontends
* 🛠️ WGSL
* ❌ SPIR-V
* 🚫 GLSL

### Supported backends
* 🛠️ SPIR-V
* ❌ HLSL
* ❌ MSL
* 🚫 GLSL
* 🚫 WGSL

✔️ = Complete
🛠️ = In progress
❌ = Planned
🚫 = Unplanned/impossible

## `WGSL` extension specification

The majority of changes relating to mesh shaders will be in WGSL and `naga`.

Using any of these features in a `wgsl` program will require adding the `enable mesh_shading` directive to the top of a program.

Two new shader stages will be added to `WGSL`. Fragment shaders are also modified slightly. Both task shaders and mesh shaders are allowed to use any compute-specific functionality, such as subgroup operations.

### Task shader

A function with the `@task` attribute is a **task shader entry point**. A mesh shader pipeline may optionally specify a task shader entry point, and if it does, mesh draw commands using that pipeline dispatch a **task shader grid** of workgroups running the task shader entry point. Like compute shader dispatches, the three-component size passed to `draw_mesh_tasks`, or drawn from the indirect buffer for its indirect variants, specifies the size of the task shader grid as the number of workgroups along each of the grid's three axes.

A task shader entry point must have a `@workgroup_size` attribute, meeting the same requirements as one appearing on a compute shader entry point.

A task shader entry point must also have a `@payload(G)` property, where `G` is the name of a global variable in the `task_payload` address space. Each task shader workgroup has its own instance of this variable, visible to all invocations in the workgroup. Whatever value the workgroup collectively stores in that global variable becomes the **task payload**, and is provided to all invocations in the mesh shader grid dispatched for the workgroup.

A task shader entry point must return a `vec3<u32>` value. The return value of each workgroup's first invocation (that is, the one whose `local_invocation_index` is `0`) is taken as the size of a **mesh shader grid** to dispatch, measured in workgroups. (If the task shader entry point returns `vec3(0, 0, 0)`, then no mesh shaders are dispatched.) Mesh shader grids are described in the next section.

Each task shader workgroup dispatches an independent mesh shader grid: in mesh shader invocations, `@builtin` values like `workgroup_id` and `global_invocation_id` describe the position of the workgroup and invocation within that grid;
and `@builtin(num_workgroups)` matches the task shader workgroup's return value. Mesh shaders dispatched for other task shader workgroups are not included in the count. If it is necessary for a mesh shader to know which task shader workgroup dispatched it, the task shader can include its own workgroup id in the task payload.

### Mesh shader

A function with the `@mesh` attribute is a **mesh shader entry point**. Mesh shaders must not return anything.

Like compute shaders, mesh shaders are invoked in a grid of workgroups, called a **mesh shader grid**. If the mesh shader pipeline has a task shader, then each task shader workgroup determines the size of a mesh shader grid to be dispatched, as described above. Otherwise, the three-component size passed to `draw_mesh_tasks`, or drawn from the indirect buffer for its indirect variants, specifies the size of the mesh shader grid directly, as the number of workgroups along each of the grid's three axes.

If the mesh shader pipeline has a task shader entry point, then the pipeline's mesh shader entry point must also have a `@payload(G)` attribute, naming the same variable, and the sizes must match. Mesh shader invocations can read, but not write, this variable, which is initialized to whatever value was written to it by the task shader workgroup that dispatched this mesh shader grid.

If the mesh shader pipeline does not have a task shader entry point, then the mesh shader entry point must not have any `@payload` attribute.

A mesh shader entry point must have the following attributes:

- `@workgroup_size`: this has the same meaning as when it appears on a compute shader entry point.

- `@vertex_output(V, NV)`: This indicates that the mesh shader workgroup will generate at most `NV` vertex values, each of type `V`.

- `@primitive_output(P, NP)`: This indicates that the mesh shader workgroup will generate at most `NP` primitives, each of type `P`.

Each mesh shader entry point invocation must call the `setMeshOutputs(numVertices: u32, numPrimitives: u32)` builtin function at least once. The values passed by each workgroup's first invocation (that is, the one whose `local_invocation_index` is `0`) determine how many vertices (values of type `V`) and primitives (values of type `P`) the workgroup must produce. The user can still write past these indices, but they won't be used in the output.

The `numVertices` and `numPrimitives` arguments must be no greater than `NV` and `NP` from the `@vertex_output` and `@primitive_output` attributes.

To produce vertex data, the workgroup as a whole must make `numVertices` calls to the `setVertex(i: u32, vertex: V)` builtin function. This establishes `vertex` as the value of the `i`'th vertex, where `i` is less than the maximum number of output vertices in the `@vertex_output` attribute. `V` is the type given in the `@vertex_output` attribute. `V` must meet the same requirements as a struct type returned by a `@vertex` entry point: all members must have either `@builtin` or `@location` attributes, there must be a `@builtin(position)`, and so on.

To produce primitives, the workgroup as a whole must make `numPrimitives` calls to the `setPrimitive(i: u32, primitive: P)` builtin function. This establishes `primitive` as the value of the `i`'th primitive, where `i` is less than the maximum number of output primitives in the `@primitive_output` attribute. `P` is the type given in the `@primitive_output` attribute. `P` must be a struct type, every member of which either has a `@location` or `@builtin` attribute. The following `@builtin` attributes are allowed:

- `triangle_indices`, `line_indices`, or `point_index`: The annotated member must be of type `vec3<u32>`, `vec2<u32>`, or `u32`.

    The member's components are indices (or, its value is an index) into the list of vertices generated by this workgroup, identifying the vertices of the primitive to be drawn. These indices must be less than the value of `numVertices` passed to `setMeshOutputs`.

    The type `P` must contain exactly one member with one of these attributes, determining what sort of primitives the mesh shader generates.

- `cull_primitive`: The annotated member must be of type `bool`. If it is true, then the primitive is skipped during rendering.

Every member of `P` with a `@location` attribute must either have a `@per_primitive` attribute, or be part of a struct type that appears in the primitive data as a struct member with the `@per_primitive` attribute.

The `@location` attributes of `P` and `V` must not overlap, since they are merged to produce the user-defined inputs to the fragment shader.

It is possible to write to the same vertex or primitive index repeatedly. Since the implicit arrays written by `setVertex` and `setPrimitive` are shared by the workgroup, data races on writes to the same index for a given type are undefined behavior.

### Fragment shader

Fragment shaders can access vertex output data as if it is from a vertex shader. They can also access primitive output data, provided the input is decorated with `@per_primitive`. The `@per_primitive` attribute can be applied to a value directly, such as `@per_primitive @location(1) value: vec4<f32>`, to a struct such as `@per_primitive primitive_input: PrimitiveInput` where `PrimitiveInput` is a struct containing fields decorated with `@location` and `@builtin`, or to members of a struct that are themselves decorated with `@location` or `@builtin`.

The primitive state is part of the fragment input and must match the output of the mesh shader in the pipeline. Using `@per_primitive` also requires enabling the mesh shader extension. Additionally, the locations of vertex and primitive input cannot overlap.

### Full example

The following is a full example of WGSL shaders that could be used to create a mesh shader pipeline, showing off many of the features.

```wgsl
enable mesh_shading;

const positions = array(
	vec4(0.,1.,0.,1.),
	vec4(-1.,-1.,0.,1.),
	vec4(1.,-1.,0.,1.)
);
const colors = array(
	vec4(0.,1.,0.,1.),
	vec4(0.,0.,1.,1.),
	vec4(1.,0.,0.,1.)
);
struct TaskPayload {
	colorMask: vec4<f32>,
	visible: bool,
}
var<task_payload> taskPayload: TaskPayload;
var<workgroup> workgroupData: f32;
struct VertexOutput {
	@builtin(position) position: vec4<f32>,
	@location(0) color: vec4<f32>,
}
struct PrimitiveOutput {
	@builtin(triangle_indices) index: vec3<u32>,
	@builtin(cull_primitive) cull: bool,
	@per_primitive @location(1) colorMask: vec4<f32>,
}
struct PrimitiveInput {
	@per_primitive @location(1) colorMask: vec4<f32>,
}

@task
@payload(taskPayload)
@workgroup_size(1)
fn ts_main() -> @builtin(mesh_task_size) vec3<u32> {
	workgroupData = 1.0;
	taskPayload.colorMask = vec4(1.0, 1.0, 0.0, 1.0);
	taskPayload.visible = true;
	return vec3(3, 1, 1);
}
@mesh
@payload(taskPayload)
@vertex_output(VertexOutput, 3) @primitive_output(PrimitiveOutput, 1)
@workgroup_size(1)
fn ms_main(@builtin(local_invocation_index) index: u32, @builtin(global_invocation_id) id: vec3<u32>) {
	setMeshOutputs(3, 1);
	workgroupData = 2.0;
	var v: VertexOutput;

	v.position = positions[0];
	v.color = colors[0] * taskPayload.colorMask;
	setVertex(0, v);

	v.position = positions[1];
	v.color = colors[1] * taskPayload.colorMask;
	setVertex(1, v);

	v.position = positions[2];
	v.color = colors[2] * taskPayload.colorMask;
	setVertex(2, v);

	var p: PrimitiveOutput;
	p.index = vec3<u32>(0, 1, 2);
	p.cull = !taskPayload.visible;
	p.colorMask = vec4<f32>(1.0, 0.0, 1.0, 1.0);
	setPrimitive(0, p);
}
@fragment
fn fs_main(vertex: VertexOutput, primitive: PrimitiveInput) -> @location(0) vec4<f32> {
	return vertex.color * primitive.colorMask;
}
```