wgpu/docs/api-specs/mesh_shading.md
Inner Daemons bf9f7524ec
Add mesh shading info to naga IR (#8104)
Co-authored-by: Jim Blandy <jimb@red-bean.com>
Co-authored-by: SupaMaggie70Incorporated <85136135+SupaMaggie70Incorporated@users.noreply.github.com>
2025-10-29 14:49:51 -04:00

18 KiB

Mesh Shader Extensions

🧪Experimental🧪

wgpu supports an experimental version of mesh shading when Features::EXPERIMENTAL_MESH_SHADER is enabled. Currently naga has no support for parsing or writing mesh shaders. For this reason, all shaders must be created with Device::create_shader_module_passthrough.

Note: The features documented here may have major bugs in them and are expected to be subject to breaking changes, suggestions for the API exposed by this should be posted on the mesh-shading issue.

Mesh shaders overview

What are mesh shaders?

Mesh shaders are a new kind of rasterization pipeline intended to address some of the shortfalls with the vertex shader pipeline. The core idea of mesh shaders is that the GPU decides how to render the many small parts of a scene instead of the CPU issuing a draw call for every small part or issuing an inefficient monolithic draw call for a large part of the scene.

Mesh shaders are specifically designed to be used with meshlet rendering, a technique where every object is split into many subobjects called meshlets that are each rendered with their own parameters. With the standard vertex pipeline, each draw call specifies an exact number of primitives to render and the same parameters for all vertex shaders on an entire object (or even multiple objects). This doesn't leave room for different LODs for different parts of an object, for example a closer part having more detail, nor does it allow culling smaller sections (or primitives) of objects. With mesh shaders, each task workgroup might get assigned to a single object. It can then analyze the different meshlets(sections) of that object, determine which are visible and should actually be rendered, and for those meshlets determine what LOD to use based on the distance from the camera. It can then dispatch a mesh workgroup for each meshlet, with each mesh workgroup then reading the data for that specific LOD of its meshlet, determining which and how many vertices and primitives to output, determining which remaining primitives need to be culled, and passing the resulting primitives to the rasterizer.

Mesh shaders are most effective in scenes with many polygons. They can allow skipping processing of entire groups of primitives that are facing away from the camera or otherwise occluded, which reduces the number of primitives that need to be processed by more than half in most cases, and they can reduce the number of primitives that need to be processed for more distant objects. Scenes that are not bottlenecked by geometry (perhaps instead by fragment processing or post processing) will not see much benefit from using them.

Mesh shaders were first shown off in NVIDIA's asteroids demo. Now, they form the basis for Unreal Engine's Nanite.

Mesh shader pipeline

With the current pipeline set to a mesh pipeline, a draw command like render_pass.draw_mesh_tasks(x, y, z) takes the following steps:

  • If the pipeline has a task shader stage:

    • Dispatch a grid of task shader workgroups, where x, y, and z give the number of workgroups along each axis of the grid. Each task shader workgroup produces a mesh shader workgroup grid size (mx, my, mz) and a task payload value mp.

    • For each task shader workgroup, dispatch a grid of mesh shader workgroups, where mx, my, and mz give the number of workgroups along each axis of the grid. Pass mp to each of these workgroup's mesh shader invocations.

  • Alternatively, if the pipeline does not have a task shader stage:

    • Dispatch a single grid of mesh shader workgroups, where x, y, and z give the number of workgroups along each axis of the grid. These mesh shaders receive no task payload value.
  • Each mesh shader workgroup produces a list of output vertices, and a list of primitives built from those vertices. The workgroup can supply per-primitive values as well, if needed. Each primitive selects its vertices by index, like an indexed draw call, from among the vertices generated by this workgroup.

    Unlike a grid of ordinary compute shader workgroups collaborating to build vertex and index data in common storage buffers, the vertices and primitives produced by a mesh shader workgroup are entirely private to that workgroup, and are not accessible by other workgroups.

  • Primitives produced by a mesh shader workgroup can have a culling flag. If a primitive's culling flag is false, it is skipped during rasterization.

  • The primitives produced by all mesh shader workgroups are then rasterized in the usual way, with each fragment shader invocation handling one pixel.

    Attributes from the vertices produced by the mesh shader workgroup are provided to the fragment shader with interpolation applied as appropriate.

    If the mesh shader workgroup supplied per-primitive values, these are available to each primitive's fragment shader invocations. Per-primitive values are never interpolated; fragment shaders simply receive the values the mesh shader workgroup associated with their primitive.

wgpu API

New wgpu functions

Device::create_mesh_pipeline - Creates a mesh shader pipeline. This is very similar to creating a standard render pipeline, except that it takes a mesh shader state and optional task shader state instead of a vertex state. If the task state is omitted, during rendering the number of workgroups is passed directly from the draw call to the mesh shader state, with an empty payload.

RenderPass::draw_mesh_tasks - Dispatches the mesh shader pipeline. This ignores render pipeline specific information, such as vertex buffer bindings and index buffer bindings. The dispatch size must adhere to the limits described below.

RenderPass::draw_mesh_tasks_indirect, RenderPass::multi_draw_mesh_tasks_indirect and RenderPass::multi_draw_mesh_tasks_indirect_count - Dispatches the mesh shader pipeline with dispatch size taken from a buffer. This ignores render pipeline specific information, such as vertex buffer bindings and index buffer bindings. The dispatch size must adhere to the limits described below. Analogous to draw_indirect, multi_draw_indirect and multi_draw_indirect_count. Requires the corresponding indirect feature to be enabled.

An example of using mesh shaders to render a single triangle can be seen here.

Features

  • Using mesh shaders requires enabling Features::EXPERIMENTAL_MESH_SHADER.
  • Using mesh shaders with multiview requires enabling Features::EXPERIMENTAL_MESH_SHADER_MULTIVIEW.
  • Currently, only triangle rendering is tested
  • Line rendering is supported but untested
  • Point rendering is supported on vulkan. It is impossible on DirectX. Metal support hasn't been checked.
  • Queries are unsupported

Limits

Note

: More limits will be added when support is added to naga.

  • Limits::max_task_workgroup_total_count - the maximum total number of workgroups from a draw_mesh_tasks command or similar. The dimensions passed must be less than or equal to this limit when multiplied together.
  • Limits::max_task_workgroups_per_dimension - the maximum for each of the 3 workgroup dimensions in a draw_mesh_tasks command. Each dimension passed must be less than or equal to this limit.
  • max_mesh_multiview_count - The maximum number of views used when multiview rendering with a mesh shader pipeline.
  • max_mesh_output_layers - the maximum number of output layers for a mesh shader pipeline.

Backend specific information

  • Only Vulkan is currently supported.
  • DirectX 12 doesn't support point rendering.
  • DirectX 12 support is planned.
  • Metal support is desired but not currently planned.

Naga implementation

Supported frontends

  • 🛠️ WGSL
  • SPIR-V
  • 🚫 GLSL

Supported backends

  • 🛠️ SPIR-V
  • HLSL
  • MSL
  • 🚫 GLSL
  • 🚫 WGSL

✔️ = Complete 🛠️ = In progress = Planned 🚫 = Unplanned/impossible

WGSL extension specification

The majority of changes relating to mesh shaders will be in WGSL and naga.

Using any of these features in a wgsl program will require adding the enable mesh_shading directive to the top of a program.

Two new shader stages will be added to WGSL. Fragment shaders are also modified slightly. Both task shaders and mesh shaders are allowed to use any compute-specific functionality, such as subgroup operations.

Task shader

A function with the @task attribute is a task shader entry point. A mesh shader pipeline may optionally specify a task shader entry point, and if it does, mesh draw commands using that pipeline dispatch a task shader grid of workgroups running the task shader entry point. Like compute shader dispatches, the three-component size passed to draw_mesh_tasks, or drawn from the indirect buffer for its indirect variants, specifies the size of the task shader grid as the number of workgroups along each of the grid's three axes.

A task shader entry point must have a @workgroup_size attribute, meeting the same requirements as one appearing on a compute shader entry point.

A task shader entry point must also have a @payload(G) property, where G is the name of a global variable in the task_payload address space. Each task shader workgroup has its own instance of this variable, visible to all invocations in the workgroup. Whatever value the workgroup collectively stores in that global variable becomes the task payload, and is provided to all invocations in the mesh shader grid dispatched for the workgroup.

A task shader entry point must return a vec3<u32> value. The return value of each workgroup's first invocation (that is, the one whose local_invocation_index is 0) is taken as the size of a mesh shader grid to dispatch, measured in workgroups. (If the task shader entry point returns vec3(0, 0, 0), then no mesh shaders are dispatched.) Mesh shader grids are described in the next section.

Each task shader workgroup dispatches an independent mesh shader grid: in mesh shader invocations, @builtin values like workgroup_id and global_invocation_id describe the position of the workgroup and invocation within that grid; and @builtin(num_workgroups) matches the task shader workgroup's return value. Mesh shaders dispatched for other task shader workgroups are not included in the count. If it is necessary for a mesh shader to know which task shader workgroup dispatched it, the task shader can include its own workgroup id in the task payload.

Mesh shader

A function with the @mesh attribute is a mesh shader entry point. Mesh shaders must not return anything.

Like compute shaders, mesh shaders are invoked in a grid of workgroups, called a mesh shader grid. If the mesh shader pipeline has a task shader, then each task shader workgroup determines the size of a mesh shader grid to be dispatched, as described above. Otherwise, the three-component size passed to draw_mesh_tasks, or drawn from the indirect buffer for its indirect variants, specifies the size of the mesh shader grid directly, as the number of workgroups along each of the grid's three axes.

If the mesh shader pipeline has a task shader entry point, then the pipeline's mesh shader entry point must also have a @payload(G) attribute, naming the same variable, and the sizes must match. Mesh shader invocations can read, but not write, this variable, which is initialized to whatever value was written to it by the task shader workgroup that dispatched this mesh shader grid.

If the mesh shader pipeline does not have a task shader entry point, then the mesh shader entry point must not have any @payload attribute.

A mesh shader entry point must have the following attributes:

  • @workgroup_size: this has the same meaning as when it appears on a compute shader entry point.

  • @vertex_output(V, NV): This indicates that the mesh shader workgroup will generate at most NV vertex values, each of type V.

  • @primitive_output(P, NP): This indicates that the mesh shader workgroup will generate at most NP primitives, each of type P.

Each mesh shader entry point invocation must call the setMeshOutputs(numVertices: u32, numPrimitives: u32) builtin function at least once. The values passed by each workgroup's first invocation (that is, the one whose local_invocation_index is 0) determine how many vertices (values of type V) and primitives (values of type P) the workgroup must produce. The user can still write past these indices, but they won't be used in the output.

The numVertices and numPrimitives arguments must be no greater than NV and NP from the @vertex_output and @primitive_output attributes.

To produce vertex data, the workgroup as a whole must make numVertices calls to the setVertex(i: u32, vertex: V) builtin function. This establishes vertex as the value of the i'th vertex, where i is less than the maximum number of output vertices in the @vertex_output attribute. V is the type given in the @vertex_output attribute. V must meet the same requirements as a struct type returned by a @vertex entry point: all members must have either @builtin or @location attributes, there must be a @builtin(position), and so on.

To produce primitives, the workgroup as a whole must make numPrimitives calls to the setPrimitive(i: u32, primitive: P) builtin function. This establishes primitive as the value of the i'th primitive, where i is less than the maximum number of output primitives in the @primitive_output attribute. P is the type given in the @primitive_output attribute. P must be a struct type, every member of which either has a @location or @builtin attribute. The following @builtin attributes are allowed:

  • triangle_indices, line_indices, or point_index: The annotated member must be of type vec3<u32>, vec2<u32>, or u32.

    The member's components are indices (or, its value is an index) into the list of vertices generated by this workgroup, identifying the vertices of the primitive to be drawn. These indices must be less than the value of numVertices passed to setMeshOutputs.

    The type P must contain exactly one member with one of these attributes, determining what sort of primitives the mesh shader generates.

  • cull_primitive: The annotated member must be of type bool. If it is true, then the primitive is skipped during rendering.

Every member of P with a @location attribute must either have a @per_primitive attribute, or be part of a struct type that appears in the primitive data as a struct member with the @per_primitive attribute.

The @location attributes of P and V must not overlap, since they are merged to produce the user-defined inputs to the fragment shader.

It is possible to write to the same vertex or primitive index repeatedly. Since the implicit arrays written by setVertex and setPrimitive are shared by the workgroup, data races on writes to the same index for a given type are undefined behavior.

Fragment shader

Fragment shaders can access vertex output data as if it is from a vertex shader. They can also access primitive output data, provided the input is decorated with @per_primitive. The @per_primitive attribute can be applied to a value directly, such as @per_primitive @location(1) value: vec4<f32>, to a struct such as @per_primitive primitive_input: PrimitiveInput where PrimitiveInput is a struct containing fields decorated with @location and @builtin, or to members of a struct that are themselves decorated with @location or @builtin.

The primitive state is part of the fragment input and must match the output of the mesh shader in the pipeline. Using @per_primitive also requires enabling the mesh shader extension. Additionally, the locations of vertex and primitive input cannot overlap.

Full example

The following is a full example of WGSL shaders that could be used to create a mesh shader pipeline, showing off many of the features.

enable mesh_shading;

const positions = array(
	vec4(0.,1.,0.,1.),
	vec4(-1.,-1.,0.,1.),
	vec4(1.,-1.,0.,1.)
);
const colors = array(
	vec4(0.,1.,0.,1.),
	vec4(0.,0.,1.,1.),
	vec4(1.,0.,0.,1.)
);
struct TaskPayload {
	colorMask: vec4<f32>,
	visible: bool,
}
var<task_payload> taskPayload: TaskPayload;
var<workgroup> workgroupData: f32;
struct VertexOutput {
	@builtin(position) position: vec4<f32>,
	@location(0) color: vec4<f32>,
}
struct PrimitiveOutput {
	@builtin(triangle_indices) index: vec3<u32>,
	@builtin(cull_primitive) cull: bool,
	@per_primitive @location(1) colorMask: vec4<f32>,
}
struct PrimitiveInput {
	@per_primitive @location(1) colorMask: vec4<f32>,
}

@task
@payload(taskPayload)
@workgroup_size(1)
fn ts_main() -> @builtin(mesh_task_size) vec3<u32> {
	workgroupData = 1.0;
	taskPayload.colorMask = vec4(1.0, 1.0, 0.0, 1.0);
	taskPayload.visible = true;
	return vec3(3, 1, 1);
}
@mesh
@payload(taskPayload)
@vertex_output(VertexOutput, 3) @primitive_output(PrimitiveOutput, 1)
@workgroup_size(1)
fn ms_main(@builtin(local_invocation_index) index: u32, @builtin(global_invocation_id) id: vec3<u32>) {
	setMeshOutputs(3, 1);
	workgroupData = 2.0;
	var v: VertexOutput;

	v.position = positions[0];
	v.color = colors[0] * taskPayload.colorMask;
	setVertex(0, v);

	v.position = positions[1];
	v.color = colors[1] * taskPayload.colorMask;
	setVertex(1, v);

	v.position = positions[2];
	v.color = colors[2] * taskPayload.colorMask;
	setVertex(2, v);

	var p: PrimitiveOutput;
	p.index = vec3<u32>(0, 1, 2);
	p.cull = !taskPayload.visible;
	p.colorMask = vec4<f32>(1.0, 0.0, 1.0, 1.0);
	setPrimitive(0, p);
}
@fragment
fn fs_main(vertex: VertexOutput, primitive: PrimitiveInput) -> @location(0) vec4<f32> {
	return vertex.color * primitive.colorMask;
}