Mojo is a Python-family language built for high-performance computing. It allows you to write custom algorithms for GPUs without the use of CUDA or other vendor-specific libraries. The examples in this directory show a few different ways you can compile and dispatch Mojo functions on a GPU.
These examples are complementary with the other examples to write custom graph ops that run on both CPUs and GPUs.
Important
These examples require a compatible GPU.
The examples include the following:
-
vector_addition.mojo: A common "hello world" example for GPU programming, this adds two vectors together in the same way as seen in Chapter 2 of Programming Massively Parallel Processors.
-
grayscale.mojo: The parallelized conversion of an RGB image to grayscale, as seen in Chapter 3 of "Programming Massively Parallel Processors".
-
naive_matrix_multiplication.mojo: An implementation of naive matrix multiplication, again inspired by Chapter 3 of "Programming Massively Parallel Processors".
-
mandelbrot.mojo: A parallel calculation of the number of iterations to escape in the Mandelbrot set. An example of the same computation performed as a custom graph operation can be found here.
-
reduction.mojo: A highly performant reduction kernel. For a detailed explanation see this blogpost.
-
Make sure your system includes a compatible GPU.
-
If you don't have
pixi, install it:curl -fsSL https://pixi.sh/install.sh | sh -
Clone this repo:
git clone https://github.com/modular/modular.git
-
Navigate to these examples:
cd modular/examples/mojo/gpu-functions
You can run each of the Mojo files using mojo via the pixi run command. For example:
pixi run mojo vector_addition.mojoOr enter the Pixi shell and run each directly:
pixi shellmojo vector_addition.mojo
mojo grayscale.mojo
mojo naive_matrix_multiplication.mojo
mojo mandelbrot.mojo
mojo reduction.mojoWriting individual thread-based functions in Mojo is is powered by the gpu
module, which handles all the
hardware-specific details of allocating and transferring memory between host
and accelerator, as well as compilation and execution of accelerator-targeted
functions.
The first three examples that follow are common starting points for thread-based GPU programming. They follow the first three examples in the popular GPU programming textbook Programming Massively Parallel Processors:
- Parallel addition of two vectors
- Conversion of a red-green-blue image to grayscale
- Naive matrix multiplication, with no hardware-specific optimization
The following sections explain some of these examples a bit further.
Caution
The following code in this README could be out of date. Read the corresponding source files for the latest example code.
The common "hello world" example used for data-parallel programming is the
addition of each element in two vectors. Here's how it works in our
vector_addition.mojo example:
-
Define the vector addition function.
The function itself is very simple, running once per thread, adding each element in the two input vectors that correspond to that thread ID, and storing the result in the output vector at the matching location.
fn vector_addition( lhs_tensor: LayoutTensor[mut=True, float_dtype, layout], rhs_tensor: LayoutTensor[mut=True, float_dtype, layout], out_tensor: LayoutTensor[mut=True, float_dtype, layout], ): tid = thread_idx.x out_tensor[tid] = lhs_tensor[tid] + rhs_tensor[tid]
-
Obtain a reference to the accelerator (GPU) context.
ctx = DeviceContext() -
Allocate input and output vectors.
Buffers for the left-hand-side and right-hand-side vectors need to be allocated on the GPU and initialized with values.
alias float_dtype = DType.float32 alias VECTOR_WIDTH = 10 lhs_buffer = ctx.enqueue_create_buffer[float_dtype](VECTOR_WIDTH) rhs_buffer = ctx.enqueue_create_buffer[float_dtype](VECTOR_WIDTH) _ = lhs_buffer.enqueue_fill(1.25) _ = rhs_buffer.enqueue_fill(2.5) lhs_tensor = lhs_tensor.move_to(gpu_device) rhs_tensor = rhs_tensor.move_to(gpu_device)
A buffer to hold the result of the calculation is allocated on the GPU:
out_buffer = ctx.enqueue_create_buffer[float_dtype](VECTOR_WIDTH)
-
Compile and dispatch the function.
The actual
vector_addition()function we want to run on the GPU is compiled and dispatched across a grid, divided into blocks of threads. All arguments to this GPU function are provided here, in an order that corresponds to their location in the function signature. Note that in Mojo, the GPU function is compiled for the GPU at the time of compilation of the Mojo file containing it.ctx.enqueue_function[vector_addition]( lhs_tensor, rhs_tensor, out_tensor, grid_dim=1, block_dim=VECTOR_WIDTH, ) -
Return the results.
Finally, the results of the calculation are moved from the GPU back to the host to be examined:
with out_buffer.map_to_host() as host_buffer: host_tensor = LayoutTensor[float_dtype, layout](host_buffer) print("Resulting vector:", host_tensor)
To try out this example yourself, run it using the following command:
pixi run mojo vector_addition.mojoFor this initial example, the output you see should be a vector where all the
elements are 3.75. Experiment with changing the vector length, the block size,
and other parameters to see how the calculation scales.
The grayscale.mojo example shows how to convert a red-green-blue (RGB) color
image into grayscale. This uses a rank-3 tensor to host the 2-D image and the
color channels at each pixel. The inputs start with three color channels, and
the output has only a single grayscale channel.
The calculation performed is a common reduction to luminance using weighted values for the three channels:
gray = 0.21 * red + 0.71 * green + 0.07 * blueAnd here is the per-thread function to perform this on the GPU:
fn color_to_grayscale(
rgb_tensor: LayoutTensor[mut=True, int_dtype, rgb_layout],
gray_tensor: LayoutTensor[mut=True, int_dtype, gray_layout],
):
row = global_idx.y
col = global_idx.x
if col < WIDTH and row < HEIGHT:
red = rgb_tensor[row, col, 0].cast[float_dtype]()
green = rgb_tensor[row, col, 1].cast[float_dtype]()
blue = rgb_tensor[row, col, 2].cast[float_dtype]()
gray = 0.21 * red + 0.71 * green + 0.07 * blue
gray_tensor[row, col, 0] = gray.cast[int_dtype]()The setup, compilation, and execution of this function is much the same as in the previous example, but in this case we're using rank-3 instead of rank-1 buffers to hold the values. Also, we dispatch the function over a 2-D grid of block, which looks like the following:
alias BLOCK_SIZE = 16
num_col_blocks = ceildiv(WIDTH, BLOCK_SIZE)
num_row_blocks = ceildiv(HEIGHT, BLOCK_SIZE)
ctx.enqueue_function[color_to_grayscale](
rgb_tensor,
gray_tensor,
grid_dim=(num_col_blocks, num_row_blocks),
block_dim=(BLOCK_SIZE, BLOCK_SIZE),
)To run this example, run this command:
pixi run mojo grayscale.mojoThis will show a grid of numbers representing the grayscale values for a single color broadcast across a simple input image. Try changing the image and block sizes to see how this scales on the GPU.
The naive_matrix_multiplication.mojo example performs a very basic matrix
multiplication, with no optimizations to take advantage of hardware resources.
The GPU function for this looks like the following:
fn naive_matrix_multiplication(
m: LayoutTensor[mut=True, float_dtype, m_layout],
n: LayoutTensor[mut=True, float_dtype, n_layout],
p: LayoutTensor[mut=True, float_dtype, p_layout],
):
row = global_idx.y
col = global_idx.x
m_dim = p.dim(0)
n_dim = p.dim(1)
k_dim = m.dim(1)
if row < m_dim and col < n_dim:
for j_index in range(k_dim):
p[row, col] = p[row, col] + m[row, j_index] * n[j_index, col]The overall setup and execution of this function are extremely similar to the previous example, with the primary change being the function that is run on the GPU.
To try out this example, run this command:
pixi run mojo naive_matrix_multiplication.mojoYou will see the two input matrices printed to the console, as well as the result of their multiplication. As with the previous examples, try changing the sizes of the matrices and how they are dispatched on the GPU.
The mandelbrot.mojo example shows a slightly more complex calculation:
the Mandelbrot set fractal.
This custom operation takes no input tensors, only a set of scalar arguments,
and returns a 2-D matrix of integer values representing the number of
iterations it took to escape at that location in complex number space.
The per-thread GPU function for this is as follows:
fn mandelbrot(
tensor: LayoutTensor[mut=True, int_dtype, layout],
):
row = global_idx.y
col = global_idx.x
alias SCALE_X = (MAX_X - MIN_X) / GRID_WIDTH
alias SCALE_Y = (MAX_Y - MIN_Y) / GRID_HEIGHT
cx = MIN_X + col * SCALE_X
cy = MIN_Y + row * SCALE_Y
c = ComplexSIMD[float_dtype, 1](cx, cy)
z = ComplexSIMD[float_dtype, 1](0, 0)
iters = Scalar[int_dtype](0)
var in_set_mask: Scalar[DType.bool] = True
for _ in range(MAX_ITERATIONS):
if not any(in_set_mask):
break
in_set_mask = z.squared_norm() <= 4
iters = in_set_mask.select(iters + 1, iters)
z = z.squared_add(c)
tensor[row, col] = itersThis begins by calculating the complex number which represents a given location
in the output grid (C). Then, starting from Z=0, the calculation Z=Z^2 + C
is iteratively calculated until Z exceeds 4, the threshold we're using for when
Z will escape the set. This occurs up until a maximum number of iterations,
and the number of iterations to escape (or not, if the maximum is hit) is then
returned for each location in the grid.
The area to examine in complex space, the resolution of the grid, and the maximum number of iterations are all provided as constants:
alias MIN_X: Scalar[float_dtype] = -2.0
alias MAX_X: Scalar[float_dtype] = 0.7
alias MIN_Y: Scalar[float_dtype] = -1.12
alias MAX_Y: Scalar[float_dtype] = 1.12
alias SCALE_X = (MAX_X - MIN_X) / GRID_WIDTH
alias SCALE_Y = (MAX_Y - MIN_Y) / GRID_HEIGHT
alias MAX_ITERATIONS = 100Try it with this command:
pixi run mojo mandelbrot.mojoThe result should be an ASCII art depiction of the region covered by the calculation:
...................................,,,,c@8cc,,,.............
...............................,,,,,,cc8M @Mjc,,,,..........
............................,,,,,,,ccccM@aQaM8c,,,,,........
..........................,,,,,,,ccc88g.o. Owg8ccc,,,,......
.......................,,,,,,,,c8888M@j, ,wMM8cccc,,.....
.....................,,,,,,cccMQOPjjPrgg, OrwrwMMMjjc,....
..................,,,,cccccc88MaP @ ,pGa.g8c,...
...............,,cccccccc888MjQp. o@8cc,..
..........,,,,c8jjMMMMMMMMM@@w. aj8c,,.
.....,,,,,,ccc88@QEJwr.wPjjjwG w8c,,.
..,,,,,,,cccccMMjwQ EpQ .8c,,,
.,,,,,,cc888MrajwJ MMcc,,,
.cc88jMMM@@jaG. oM8cc,,,
.cc88jMMM@@jaG. oM8cc,,,
.,,,,,,cc888MrajwJ MMcc,,,
..,,,,,,,cccccMMjwQ EpQ .8c,,,
.....,,,,,,ccc88@QEJwr.wPjjjwG w8c,,.
..........,,,,c8jjMMMMMMMMM@@w. aj8c,,.
...............,,cccccccc888MjQp. o@8cc,..
..................,,,,cccccc88MaP @ ,pGa.g8c,...
.....................,,,,,,cccMQOEjjPrgg, OrwrwMMMjjc,....
.......................,,,,,,,,c8888M@j, ,wMM8cccc,,.....
..........................,,,,,,,ccc88g.o. Owg8ccc,,,,......
............................,,,,,,,ccccM@aQaM8c,,,,,........
...............................,,,,,,cc8M @Mjc,,,,..........
Try changing the various parameters above to produce different resolution grids, or look into different areas in the complex number space.
-
Learn GPU programming by solving increasingly challenging GPU puzzles.