Name	Name	Last commit message	Last commit date
parent directory ..
BUILD.bazel	BUILD.bazel
README.md	README.md
grayscale.mojo	grayscale.mojo
mandelbrot.mojo	mandelbrot.mojo
naive_matrix_multiplication.mojo	naive_matrix_multiplication.mojo
pixi.lock	pixi.lock
pixi.toml	pixi.toml
reduction.mojo	reduction.mojo
vector_addition.mojo	vector_addition.mojo

GPU functions written in Mojo

Mojo is a Python-family language built for high-performance computing. It allows you to write custom algorithms for GPUs without the use of CUDA or other vendor-specific libraries. The examples in this directory show a few different ways you can compile and dispatch Mojo functions on a GPU.

These examples are complementary with the other examples to write custom graph ops that run on both CPUs and GPUs.

Important

These examples require a compatible GPU.

The examples include the following:

vector_addition.mojo: A common "hello world" example for GPU programming, this adds two vectors together in the same way as seen in Chapter 2 of Programming Massively Parallel Processors.
grayscale.mojo: The parallelized conversion of an RGB image to grayscale, as seen in Chapter 3 of "Programming Massively Parallel Processors".
naive_matrix_multiplication.mojo: An implementation of naive matrix multiplication, again inspired by Chapter 3 of "Programming Massively Parallel Processors".
mandelbrot.mojo: A parallel calculation of the number of iterations to escape in the Mandelbrot set. An example of the same computation performed as a custom graph operation can be found here.
reduction.mojo: A highly performant reduction kernel. For a detailed explanation see this blogpost.

Setup

Make sure your system includes a compatible GPU.

If you don't have pixi, install it:

curl -fsSL https://pixi.sh/install.sh | sh

Clone this repo:

git clone https://github.com/modular/modular.git

Navigate to these examples:
```
cd modular/examples/mojo/gpu-functions
```

Quickstart

You can run each of the Mojo files using mojo via the pixi run command. For example:

pixi run mojo vector_addition.mojo

Or enter the Pixi shell and run each directly:

pixi shell

mojo vector_addition.mojo
mojo grayscale.mojo
mojo naive_matrix_multiplication.mojo
mojo mandelbrot.mojo
mojo reduction.mojo

Example walkthroughs

Writing individual thread-based functions in Mojo is is powered by the gpu module, which handles all the hardware-specific details of allocating and transferring memory between host and accelerator, as well as compilation and execution of accelerator-targeted functions.

The first three examples that follow are common starting points for thread-based GPU programming. They follow the first three examples in the popular GPU programming textbook Programming Massively Parallel Processors:

Parallel addition of two vectors
Conversion of a red-green-blue image to grayscale
Naive matrix multiplication, with no hardware-specific optimization

The following sections explain some of these examples a bit further.

Caution

The following code in this README could be out of date. Read the corresponding source files for the latest example code.

Basic vector addition

The common "hello world" example used for data-parallel programming is the addition of each element in two vectors. Here's how it works in our vector_addition.mojo example:

Define the vector addition function.

The function itself is very simple, running once per thread, adding each element in the two input vectors that correspond to that thread ID, and storing the result in the output vector at the matching location.

fn vector_addition(
    lhs_tensor: LayoutTensor[mut=True, float_dtype, layout],
    rhs_tensor: LayoutTensor[mut=True, float_dtype, layout],
    out_tensor: LayoutTensor[mut=True, float_dtype, layout],
):
    tid = thread_idx.x
    out_tensor[tid] = lhs_tensor[tid] + rhs_tensor[tid]

Obtain a reference to the accelerator (GPU) context.
```
ctx = DeviceContext()
```

Allocate input and output vectors.

Buffers for the left-hand-side and right-hand-side vectors need to be allocated on the GPU and initialized with values.

alias float_dtype = DType.float32
alias VECTOR_WIDTH = 10

lhs_buffer = ctx.enqueue_create_buffer[float_dtype](VECTOR_WIDTH)
rhs_buffer = ctx.enqueue_create_buffer[float_dtype](VECTOR_WIDTH)

_ = lhs_buffer.enqueue_fill(1.25)
_ = rhs_buffer.enqueue_fill(2.5)

lhs_tensor = lhs_tensor.move_to(gpu_device)
rhs_tensor = rhs_tensor.move_to(gpu_device)

A buffer to hold the result of the calculation is allocated on the GPU:

out_buffer = ctx.enqueue_create_buffer[float_dtype](VECTOR_WIDTH)

Compile and dispatch the function.

The actual vector_addition() function we want to run on the GPU is compiled and dispatched across a grid, divided into blocks of threads. All arguments to this GPU function are provided here, in an order that corresponds to their location in the function signature. Note that in Mojo, the GPU function is compiled for the GPU at the time of compilation of the Mojo file containing it.
```
ctx.enqueue_function[vector_addition](
    lhs_tensor,
    rhs_tensor,
    out_tensor,
    grid_dim=1,
    block_dim=VECTOR_WIDTH,
)
```

Return the results.

Finally, the results of the calculation are moved from the GPU back to the host to be examined:

with out_buffer.map_to_host() as host_buffer:
    host_tensor = LayoutTensor[float_dtype, layout](host_buffer)
    print("Resulting vector:", host_tensor)

To try out this example yourself, run it using the following command:

pixi run mojo vector_addition.mojo

For this initial example, the output you see should be a vector where all the elements are 3.75. Experiment with changing the vector length, the block size, and other parameters to see how the calculation scales.

Conversion of a color image to grayscale

The grayscale.mojo example shows how to convert a red-green-blue (RGB) color image into grayscale. This uses a rank-3 tensor to host the 2-D image and the color channels at each pixel. The inputs start with three color channels, and the output has only a single grayscale channel.

The calculation performed is a common reduction to luminance using weighted values for the three channels:

gray = 0.21 * red + 0.71 * green + 0.07 * blue

And here is the per-thread function to perform this on the GPU:

fn color_to_grayscale(
    rgb_tensor: LayoutTensor[mut=True, int_dtype, rgb_layout],
    gray_tensor: LayoutTensor[mut=True, int_dtype, gray_layout],
):
    row = global_idx.y
    col = global_idx.x

    if col < WIDTH and row < HEIGHT:
        red = rgb_tensor[row, col, 0].cast[float_dtype]()
        green = rgb_tensor[row, col, 1].cast[float_dtype]()
        blue = rgb_tensor[row, col, 2].cast[float_dtype]()
        gray = 0.21 * red + 0.71 * green + 0.07 * blue

        gray_tensor[row, col, 0] = gray.cast[int_dtype]()

The setup, compilation, and execution of this function is much the same as in the previous example, but in this case we're using rank-3 instead of rank-1 buffers to hold the values. Also, we dispatch the function over a 2-D grid of block, which looks like the following:

alias BLOCK_SIZE = 16
num_col_blocks = ceildiv(WIDTH, BLOCK_SIZE)
num_row_blocks = ceildiv(HEIGHT, BLOCK_SIZE)

ctx.enqueue_function[color_to_grayscale](
    rgb_tensor,
    gray_tensor,
    grid_dim=(num_col_blocks, num_row_blocks),
    block_dim=(BLOCK_SIZE, BLOCK_SIZE),
)

To run this example, run this command:

pixi run mojo grayscale.mojo

This will show a grid of numbers representing the grayscale values for a single color broadcast across a simple input image. Try changing the image and block sizes to see how this scales on the GPU.

Naive matrix multiplication

The naive_matrix_multiplication.mojo example performs a very basic matrix multiplication, with no optimizations to take advantage of hardware resources. The GPU function for this looks like the following:

fn naive_matrix_multiplication(
    m: LayoutTensor[mut=True, float_dtype, m_layout],
    n: LayoutTensor[mut=True, float_dtype, n_layout],
    p: LayoutTensor[mut=True, float_dtype, p_layout],
):
    row = global_idx.y
    col = global_idx.x

    m_dim = p.dim(0)
    n_dim = p.dim(1)
    k_dim = m.dim(1)

    if row < m_dim and col < n_dim:
        for j_index in range(k_dim):
            p[row, col] = p[row, col] + m[row, j_index] * n[j_index, col]

The overall setup and execution of this function are extremely similar to the previous example, with the primary change being the function that is run on the GPU.

To try out this example, run this command:

pixi run mojo naive_matrix_multiplication.mojo

You will see the two input matrices printed to the console, as well as the result of their multiplication. As with the previous examples, try changing the sizes of the matrices and how they are dispatched on the GPU.

Calculating the Mandelbrot set fractal

The mandelbrot.mojo example shows a slightly more complex calculation: the Mandelbrot set fractal. This custom operation takes no input tensors, only a set of scalar arguments, and returns a 2-D matrix of integer values representing the number of iterations it took to escape at that location in complex number space.

The per-thread GPU function for this is as follows:

fn mandelbrot(
    tensor: LayoutTensor[mut=True, int_dtype, layout],
):
    row = global_idx.y
    col = global_idx.x

    alias SCALE_X = (MAX_X - MIN_X) / GRID_WIDTH
    alias SCALE_Y = (MAX_Y - MIN_Y) / GRID_HEIGHT

    cx = MIN_X + col * SCALE_X
    cy = MIN_Y + row * SCALE_Y
    c = ComplexSIMD[float_dtype, 1](cx, cy)
    z = ComplexSIMD[float_dtype, 1](0, 0)
    iters = Scalar[int_dtype](0)

    var in_set_mask: Scalar[DType.bool] = True
    for _ in range(MAX_ITERATIONS):
        if not any(in_set_mask):
            break
        in_set_mask = z.squared_norm() <= 4
        iters = in_set_mask.select(iters + 1, iters)
        z = z.squared_add(c)

    tensor[row, col] = iters

This begins by calculating the complex number which represents a given location in the output grid (C). Then, starting from Z=0, the calculation Z=Z^2 + C is iteratively calculated until Z exceeds 4, the threshold we're using for when Z will escape the set. This occurs up until a maximum number of iterations, and the number of iterations to escape (or not, if the maximum is hit) is then returned for each location in the grid.

The area to examine in complex space, the resolution of the grid, and the maximum number of iterations are all provided as constants:

alias MIN_X: Scalar[float_dtype] = -2.0
alias MAX_X: Scalar[float_dtype] = 0.7
alias MIN_Y: Scalar[float_dtype] = -1.12
alias MAX_Y: Scalar[float_dtype] = 1.12
alias SCALE_X = (MAX_X - MIN_X) / GRID_WIDTH
alias SCALE_Y = (MAX_Y - MIN_Y) / GRID_HEIGHT
alias MAX_ITERATIONS = 100

Try it with this command:

pixi run mojo mandelbrot.mojo

The result should be an ASCII art depiction of the region covered by the calculation:

...................................,,,,c@8cc,,,.............
...............................,,,,,,cc8M @Mjc,,,,..........
............................,,,,,,,ccccM@aQaM8c,,,,,........
..........................,,,,,,,ccc88g.o. Owg8ccc,,,,......
.......................,,,,,,,,c8888M@j,    ,wMM8cccc,,.....
.....................,,,,,,cccMQOPjjPrgg,   OrwrwMMMjjc,....
..................,,,,cccccc88MaP  @            ,pGa.g8c,...
...............,,cccccccc888MjQp.                   o@8cc,..
..........,,,,c8jjMMMMMMMMM@@w.                      aj8c,,.
.....,,,,,,ccc88@QEJwr.wPjjjwG                        w8c,,.
..,,,,,,,cccccMMjwQ       EpQ                         .8c,,,
.,,,,,,cc888MrajwJ                                   MMcc,,,
.cc88jMMM@@jaG.                                     oM8cc,,,
.cc88jMMM@@jaG.                                     oM8cc,,,
.,,,,,,cc888MrajwJ                                   MMcc,,,
..,,,,,,,cccccMMjwQ       EpQ                         .8c,,,
.....,,,,,,ccc88@QEJwr.wPjjjwG                        w8c,,.
..........,,,,c8jjMMMMMMMMM@@w.                      aj8c,,.
...............,,cccccccc888MjQp.                   o@8cc,..
..................,,,,cccccc88MaP  @            ,pGa.g8c,...
.....................,,,,,,cccMQOEjjPrgg,   OrwrwMMMjjc,....
.......................,,,,,,,,c8888M@j,    ,wMM8cccc,,.....
..........................,,,,,,,ccc88g.o. Owg8ccc,,,,......
............................,,,,,,,ccccM@aQaM8c,,,,,........
...............................,,,,,,cc8M @Mjc,,,,..........

Try changing the various parameters above to produce different resolution grids, or look into different areas in the complex number space.

Next Steps

See our tutorial to get started with GPU programming.
Learn GPU programming by solving increasingly challenging GPU puzzles.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

GPU functions written in Mojo

Setup

Quickstart

Example walkthroughs

Basic vector addition

Conversion of a color image to grayscale

Naive matrix multiplication

Calculating the Mandelbrot set fractal

Next Steps

FilesExpand file tree

gpu-functions

Directory actions

More options

Directory actions

More options

Latest commit

History

gpu-functions

Folders and files

parent directory

README.md

GPU functions written in Mojo

Setup

Quickstart

Example walkthroughs

Basic vector addition

Conversion of a color image to grayscale

Naive matrix multiplication

Calculating the Mandelbrot set fractal

Next Steps