Introduction

Morello is a synthesis-based compiler of neural network forward passes, primarily for CPUs. It provides programmers with a Halide-like language for manual optimization, but it can also automatically synthesize implementations which minimize a cost model (typically: maximizing throughput).

This book introduces the key ideas behind Morello by walking through manual optimization of matrix multiplication, then by synthesizing matrix multiplications automatically.

Status

Morello is under active development. The language and compiler internals are regularly changing, and we do not yet ship versioned releases.

If you’re interested in a stable library, you should target a particular version from the main branch. If using Morello as a library, specify the dependency in Cargo.toml such that it fixes a SHA1 hash:

morello = { git = "https://github.com/samkaufman/morello.git", rev = "..." }

Hierarchical IR

Morello is built around a hierarchical IR¹ where the program and every sub-program (i.e., program node) has a specification called a Spec.

The key idea is that programs over tensors can be built by incrementally substituting goal Specs with partial programs containing increasingly small and manageable goal Specs. Eventually, this process terminates in a complete C program with no nested Specs.

You can also understand this as the dual of a bottom-up process where instructions are composed via loops and other control structures to form larger and larger programs until the final program implements the initial goal.

If you’ve optimized GEMM, then this idea is already familiar to you. It is seen in, for example, tiling GEMM to minimize the total cost of cache misses. Each tiling of the tensor dimensions can be seen as a rewrite into a loop where the body contains a goal Spec with smaller dimensions.

IR is short for intermediate representation, which is a program representation intended primarily for internal, automated manipulation by a compiler. This is in contrast to a programming language, which is a representation meant for reading and writing by a programmer.

Schedule a Dot Product

Let’s make this more concrete by implementing a dot product of two 32-value vectors. We’ll begin with a goal Spec, then introduce an accumulating scalar, tile, and choose an innermost instruction.

Begin by writing out a simplified² goal Spec Matmul(1x32x1):

#include <stdio.h>

int main() {
  int output;
  int lhs[32], rhs[32];
output = 0;
for (unsigned i = 0; i < 32; i++) {
  output += lhs[i] * rhs[i];
}
  return 0;
}

With Morello, a programmer begins with a partial program containing only the goal Spec:

// Matmul(1x32x1)

The first step is to rewrite that trivially nested Spec into a partial program which initializes an accumulator and contains a new, accumulating Spec.

output = 0;
// MatmulAccum(1x32x1)

Next, that nested Spec is replaced by a loop followed by a version of the original Spec tiled over the k dimension:

output = 0;
for (unsigned i = 0; i < 32; i++) {
  // MatmulAccum(1x1x1)
}

Finally, that nested Spec is replaced with a C multiply-accumulate statement:

output = 0;
for (unsigned i = 0; i < 32; i++) {
  output += lhs[i] * rhs[i];
}

At this point, no nested Specs remain and the program is complete.

Morello’s actual Specs are more complex, describing much more than just the tensor dimensions, but this will do more now.

Manual Scheduling

First Schedule

TODO: Address or hide canonicalization. (This is a footgun.)

Install

To get our feet wet with Morello, we’ll write a program which generates the simple dot product implementation we described in the previous section.

First, initialize a fresh bin and add the crate to your Cargo dependencies:

cargo init --bin ./dotproduct
cd dotproduct
cargo add --git https://github.com/samkaufman/morello morello

Define the Goal Spec

In src/main.rs, begin by constructing the dot product Spec from the last section. We’ll use the spec macro. While, in the last section, our Spec only described whether or not a matrix multiplication should accumulate into the output and the size of its three, we’ll now include information about the data type, memory level, layout of its input and output tensors, as well as a flag indicating that this the implementation should run on a single thread (serial), and the maximum usable memory at each level.

use morello::layout::row_major;
use morello::lspec;
use morello::spec::Spec;
use morello::target::{
    Avx2Target,
    CpuMemoryLevel::{self, RF},
    Target,
};

    let mut spec: Spec<Avx2Target> = spec!(Matmul(
        [1, 1, 32, 1],
        (u32, RF, row_major), // `RF` = tensor is in register file
        (u32, RF, row_major),
        (u32, RF, row_major),
        serial
    ));
    spec.canonicalize().unwrap();

Notice that the Spec is parameterized by Avx2Target. Morello targets are types which define a set of target-specific instructions, a set of available Spec rewrites, and basic cost and memory models. (TODO: What else? TODO: Describe re-targeting.) As you might have guessed, this example will target X86, though we don’t yet use any X86-specific intrinsics.

Schedule an Implementation

Next, we construct an implementation by applying scheduling operators to the Spec. This is called ``scheduling.‘’ We’ll apply three operators, corresponding to the three rewrites described in the previous section:

to_accum to introduce an accumulator followed by a MatmulAccum,
split(&[1, 1]) to introduce a loop over the k dimension, and
select(CpuKernel::MultAdd) to replacement the body with the C multiply-accumulate.

Pull in the SchedulingSugar trait, which extends Specs with these operators, as well as CpuKernel.

use morello::scheduling_sugar::SchedulingSugar;
use morello::target::CpuKernel;

Then apply:

    let implementation = spec.to_accum().split(1).select(CpuKernel::MultAdd);

With emit, we can print the resulting implementation to stdout.

    implementation.emit_stdout().unwrap();

This will print the source for a complete C executable. Inside the kernel function, you’ll find the dot product implementation:

/* (Zero((1×1, u32, RF), serial), [64, 1024, 0, 0])(_) */                                                                                                                                                      
assert(false);  // missing imp(n002[_])                                                                                                                                                                        
for (int n003 = 0; n003 < 32; n003++) {                                                                                                                                                                        
n002[(0)] += n000[(n003)] * n001[(n003)];  /* MultAdd */

But what’s this assert(false)? This won’t compile at all!

Sub-Scheduling the Zero

TODO: Fill in.

Memory Movement

Specification Language

Primitive Specs

Specs are applications of a function to tensor specifications as well as a flag indicating whether or not the Spec is guaranteed to run on a single thread and limits on the amount of memory that can be used at each memory level.

For example, the following constructs a Spec targeting X86:

Spec::<Avx2Target>(
    lspec!(Matmul(
        [M, K, N],
        (bf16, GL, row_major),
        (bf16, GL, col_major),
        (f32, GL, row_major),
        serial
    )),
    Avx2Target::max_mem(),
)

Tensor specifications describe each of the parameters of the function. Tensor specifications describe the tensor’s data type, the memory level where that tensor’s data is stored, the layout of the data, whether that data is ``aligned’’ (buffer address is some target-specific multiple), a layout-specific description of how contiguous is the data, and, in the data of data stored in a vector register file, the size of the vector register (e.g., 128- or 256-bit on AVX2).

Available functions are listed as variants of the PrimitiveSpecType enum.

Composition

TODO:: Fill in.

Data Layouts

In Morello, tensors have associated memory layouts which describe how the data of that tensor are arranged in the underlying memory buffer. Layouts are buffer indexing expressions—maps from logical coordinates to buffer offsets—paired with some simple analyses.

Layouts in Morello are expressive. Dimensions can be arbitrarily reordered, yielding common layouts such as column-major for matrices and NCHW and NHWC for images. Dimensions can be tiled to produce layouts like NCHWc or even interleaving odd- and even-indexed values to avoid lane-crossing SIMD instructions. (Layouts do not, however, allow aliasing; every logical coordinate corresponds to a single buffer offset.)

Dimension Reordering

Let’s look at row-major layouts, which are usually the default memory layout in other deep learning frameworks. In the case of matrices (2-dimensional tensors), a row-major layout is one where each value is at offset \( Nd_0 + d_1 \), where \(N\), \(d0\), \(d1\) are the number of columns in the matrix, row of the value, and column of the value respectively. A 6x8 matrix with a row-major layout looks like this:

Integers in cells are offsets into the matrix’s buffer.

This is called a row-major layout because the row dimension (dimension 0) varies the slowest as you sequentially scan values in memory.

In Morello, a row-major layout is constructed with row_major(shp). Morello abuses the term ’row-major’ to refer to any layout where logical dimensions are laid out sequentially, even if the tensor has more or fewer than 2 dimensions. As long as all dimensions are greater than size 1, row_major is sugar for:

Layout::new(vec![
  (0, PhysDim::Dynamic),
  (1, PhysDim::Dynamic),
  /* .. */,
  (n - 1, PhysDim::Dynamic)
])

The Layout::new constructor takes physical dimensions in order from slowest- to fastest-varying. In this example, PhysDim::Dynamic simply means that the whole logical dimension should be mapped to a single physical dimension. (We’ll discuss this in more detail later.)

Knowing this, you can see that column-major order is written:

Layout::new(vec![
    (1, PhysDim::Dynamic),
    (0, PhysDim::Dynamic)
])

Packing

Layouts can also express packing, sometimes called layout tiling (e.g., in XLA). Here’s an example of a packed layout:

Layout::new(vec![
  (1, PhysDim::Dynamic),
  (0, PhysDim::Dynamic),
  (1, PhysDim::Packed(nz!(4u32))),
])

A 6x8 matrix with a packed layout looks like this:

Integers in cells are offsets into the matrix’s buffer.

Notice we’ve introduced PhysDim::Packed, which is simply a physical dimension with a fixed size. This differs from PhysDim::Dynamic, which has a size inferred from the tensor shape. (A layout can contain only one Dynamic per logical dimension, otherwise the layout would be ambiguous.)

Notice that, unlike in the previous example, dimension 1 is mentioned twice in the above example. What does that mean?

Layouts are built compositionally from PhysDims, back to front, until the entire tensor shape is covered. For each PhysDim::Dynamic or PhysDim::Packed associated with a logical dimension \(i\), \(\alpha_i \left( \lfloor \frac{d_i}{v_i} \rfloor \pmod{p_i} \right)\) is added to the buffer indexing expression where \(\alpha_i\) is previously covered volume, \(v_i\) is previously covered volume in that logical dimension, and \(p_i\) is the physical dimension size.

Let’s derive the buffer indexing expression for the above example. First, the Packed dimension over dimension 1 yields \(b_\alpha=d_1 \pmod{4}\). The Dynamic for dimension 0 adds a term: \(b_\beta=b_\alpha+4(d_0 \pmod{6})\). Dimension zero is known to be in \([0, 5]\), so we simplify: \(b_\beta=b_\alpha+4d_0\). The final Dynamic for dimension 1 adds another: \(b = b_\alpha + b_\beta + 24 \lfloor \frac{d_1}{4} \rfloor\).

Odd-Even Layouts

TODO: Describe PhysDim::OddEven.

Automatic Synthesis

TODO: Write this chapter.

To synthesize, construct an in-memory database and call synthesize on a Spec:

let db = FilesDatabase::new(None, true, 1, 128, 1, None);
let implementation = spec.synthesize(&db, None);

Re-Targeting

TODO: Fill in. Retargeting involves implementing the Target trait and, likely, its associated types.

Optimizing Neural Networks with Morello