Introduction
Morello is a synthesis-based compiler of neural network forward passes, primarily for CPUs. It provides programmers with a Halide-like language for manual optimization, but it can also automatically synthesize implementations which minimize a cost model (typically: maximizing throughput).
This book introduces the key ideas behind Morello by walking through manual optimization of matrix multiplication, then by synthesizing matrix multiplications automatically.
Status
Morello is under active development. The language and compiler internals are regularly changing, and we do not yet ship versioned releases.
If you’re interested in a stable library, you should target a particular version from
the main branch. If using Morello as a library, specify the dependency in Cargo.toml
such that it fixes a SHA1 hash:
morello = { git = "https://github.com/samkaufman/morello.git", rev = "..." }
Hierarchical IR
Morello is built around a hierarchical IR1 where the program and every sub-program (i.e., program node) has a specification called a Spec.
The key idea is that programs over tensors can be built by incrementally substituting goal Specs with partial programs containing increasingly small and manageable goal Specs. Eventually, this process terminates in a complete C program with no nested Specs.
You can also understand this as the dual of a bottom-up process where instructions are composed via loops and other control structures to form larger and larger programs until the final program implements the initial goal.
If you’ve optimized GEMM, then this idea is already familiar to you. It is seen in, for example, tiling GEMM to minimize the total cost of cache misses. Each tiling of the tensor dimensions can be seen as a rewrite into a loop where the body contains a goal Spec with smaller dimensions.
IR is short for intermediate representation, which is a program representation intended primarily for internal, automated manipulation by a compiler. This is in contrast to a programming language, which is a representation meant for reading and writing by a programmer.
Schedule a Dot Product
Let’s make this more concrete by implementing a dot product of two 32-value vectors. We’ll begin with a goal Spec, then introduce an accumulating scalar, tile, and choose an innermost instruction.
Begin by writing out a simplified2 goal Spec Matmul(1x32x1)
:
#include <stdio.h>
int main() {
int output;
int lhs[32], rhs[32];
output = 0;
for (unsigned i = 0; i < 32; i++) {
output += lhs[i] * rhs[i];
}
return 0;
}
With Morello, a programmer begins with a partial program containing only the goal Spec:
// Matmul(1x32x1)
The first step is to rewrite that trivially nested Spec into a partial program which initializes an accumulator and contains a new, accumulating Spec.
output = 0;
// MatmulAccum(1x32x1)
Next, that nested Spec is replaced by a loop followed by a version of the original Spec tiled over the k dimension:
output = 0;
for (unsigned i = 0; i < 32; i++) {
// MatmulAccum(1x1x1)
}
Finally, that nested Spec is replaced with a C multiply-accumulate statement:
output = 0;
for (unsigned i = 0; i < 32; i++) {
output += lhs[i] * rhs[i];
}
At this point, no nested Specs remain and the program is complete.
Morello’s actual Specs are more complex, describing much more than just the tensor dimensions, but this will do more now.
Manual Scheduling
First Schedule
TODO: Address or hide canonicalization. (This is a footgun.)
Install
To get our feet wet with Morello, we’ll write a program which generates the simple dot product implementation we described in the previous section.
First, initialize a fresh bin and add the crate to your Cargo dependencies:
cargo init --bin ./dotproduct
cd dotproduct
cargo add --git https://github.com/samkaufman/morello morello
Define the Goal Spec
In src/main.rs
, begin by constructing the dot product Spec from the last
section. We’ll use the spec
macro. While, in the last section, our Spec only
described whether or not a matrix multiplication should accumulate into the
output and the size of its three, we’ll now include information about the data
type, memory level, layout of its input and output tensors, as well as a
flag indicating that this the implementation should run on a single thread
(serial
), and the maximum usable memory at each level.
use morello::layout::row_major;
use morello::lspec;
use morello::spec::Spec;
use morello::target::{
Avx2Target,
CpuMemoryLevel::{self, RF},
Target,
};
let mut spec: Spec<Avx2Target> = spec!(Matmul(
[1, 1, 32, 1],
(u32, RF, row_major), // `RF` = tensor is in register file
(u32, RF, row_major),
(u32, RF, row_major),
serial
));
spec.canonicalize().unwrap();
Notice that the Spec is parameterized by Avx2Target
. Morello targets are
types which define a set of target-specific instructions, a set of available
Spec rewrites, and basic cost and memory models. (TODO: What else?
TODO: Describe re-targeting.) As you might have guessed, this example will
target X86, though we don’t yet use any X86-specific intrinsics.
Schedule an Implementation
Next, we construct an implementation by applying scheduling operators to the Spec. This is called ``scheduling.‘’ We’ll apply three operators, corresponding to the three rewrites described in the previous section:
to_accum
to introduce an accumulator followed by aMatmulAccum
,split(&[1, 1])
to introduce a loop over the k dimension, andselect(CpuKernel::MultAdd)
to replacement the body with the C multiply-accumulate.
Pull in the SchedulingSugar
trait, which extends Specs with these operators,
as well as CpuKernel
.
use morello::scheduling_sugar::SchedulingSugar;
use morello::target::CpuKernel;
Then apply:
let implementation = spec.to_accum().split(1).select(CpuKernel::MultAdd);
With emit
, we can print the resulting implementation to stdout.
implementation.emit_stdout().unwrap();
This will print the source for a complete C executable. Inside the kernel
function,
you’ll find the dot product implementation:
/* (Zero((1×1, u32, RF), serial), [64, 1024, 0, 0])(_) */
assert(false); // missing imp(n002[_])
for (int n003 = 0; n003 < 32; n003++) {
n002[(0)] += n000[(n003)] * n001[(n003)]; /* MultAdd */
But what’s this assert(false)
? This won’t compile at all!
Sub-Scheduling the Zero
TODO: Fill in.
Memory Movement
Specification Language
Primitive Specs
Specs are applications of a function to tensor specifications as well as a flag indicating whether or not the Spec is guaranteed to run on a single thread and limits on the amount of memory that can be used at each memory level.
For example, the following constructs a Spec targeting X86:
Spec::<Avx2Target>(
lspec!(Matmul(
[M, K, N],
(bf16, GL, row_major),
(bf16, GL, col_major),
(f32, GL, row_major),
serial
)),
Avx2Target::max_mem(),
)
Tensor specifications describe each of the parameters of the function. Tensor specifications describe the tensor’s data type, the memory level where that tensor’s data is stored, the layout of the data, whether that data is ``aligned’’ (buffer address is some target-specific multiple), a layout-specific description of how contiguous is the data, and, in the data of data stored in a vector register file, the size of the vector register (e.g., 128- or 256-bit on AVX2).
Available functions are listed as variants of the PrimitiveSpecType
enum.
Composition
TODO:: Fill in.
Data Layouts
In Morello, tensors have associated memory layouts which describe how the data of that
tensor are arranged in the underlying memory buffer. Layout
s are buffer indexing
expressions—maps from logical coordinates to buffer offsets—paired with some simple
analyses.
Layouts in Morello are expressive. Dimensions can be arbitrarily reordered, yielding common layouts such as column-major for matrices and NCHW and NHWC for images. Dimensions can be tiled to produce layouts like NCHWc or even interleaving odd- and even-indexed values to avoid lane-crossing SIMD instructions. (Layouts do not, however, allow aliasing; every logical coordinate corresponds to a single buffer offset.)
Dimension Reordering
Let’s look at row-major layouts, which are usually the default memory layout in other deep learning frameworks. In the case of matrices (2-dimensional tensors), a row-major layout is one where each value is at offset \( Nd_0 + d_1 \), where \(N\), \(d0\), \(d1\) are the number of columns in the matrix, row of the value, and column of the value respectively. A 6x8 matrix with a row-major layout looks like this:
Integers in cells are offsets into the matrix’s buffer.
This is called a row-major layout because the row dimension (dimension 0) varies the slowest as you sequentially scan values in memory.
In Morello, a row-major layout is constructed with row_major(shp)
. Morello
abuses the term ’row-major’ to refer to any layout where logical dimensions are
laid out sequentially, even if the tensor has more or fewer than 2 dimensions.
As long as all dimensions are greater than size 1, row_major
is sugar for:
Layout::new(vec![
(0, PhysDim::Dynamic),
(1, PhysDim::Dynamic),
/* .. */,
(n - 1, PhysDim::Dynamic)
])
The Layout::new
constructor takes physical dimensions in order from slowest- to
fastest-varying. In this example, PhysDim::Dynamic
simply means that the whole logical
dimension should be mapped to a single physical dimension. (We’ll discuss this in more
detail later.)
Knowing this, you can see that column-major order is written:
Layout::new(vec![
(1, PhysDim::Dynamic),
(0, PhysDim::Dynamic)
])
Packing
Layouts can also express packing, sometimes called layout tiling (e.g., in XLA). Here’s an example of a packed layout:
Layout::new(vec![
(1, PhysDim::Dynamic),
(0, PhysDim::Dynamic),
(1, PhysDim::Packed(nz!(4u32))),
])
A 6x8 matrix with a packed layout looks like this:
Integers in cells are offsets into the matrix’s buffer.
Notice we’ve introduced PhysDim::Packed
, which is simply a physical dimension with a
fixed size. This differs from PhysDim::Dynamic
, which has a size inferred from the
tensor shape. (A layout can contain only one Dynamic
per logical dimension, otherwise
the layout would be ambiguous.)
Notice that, unlike in the previous example, dimension 1 is mentioned twice in the above example. What does that mean?
Layouts are built compositionally from PhysDim
s, back to front, until the entire
tensor shape is covered. For each PhysDim::Dynamic
or PhysDim::Packed
associated
with a logical dimension \(i\),
\(\alpha_i \left( \lfloor \frac{d_i}{v_i} \rfloor \pmod{p_i} \right)\)
is added to the buffer indexing expression where \(\alpha_i\) is previously covered
volume, \(v_i\) is previously covered volume in that logical dimension, and \(p_i\)
is the physical dimension size.
Let’s derive the buffer indexing expression for the above example. First, the Packed
dimension over dimension 1 yields \(b_\alpha=d_1 \pmod{4}\). The Dynamic
for
dimension 0 adds a term: \(b_\beta=b_\alpha+4(d_0 \pmod{6})\). Dimension zero is known
to be in \([0, 5]\), so we simplify: \(b_\beta=b_\alpha+4d_0\). The final
Dynamic
for dimension 1 adds another: \(b = b_\alpha + b_\beta + 24 \lfloor
\frac{d_1}{4} \rfloor\).
Odd-Even Layouts
TODO: Describe PhysDim::OddEven
.
Automatic Synthesis
TODO: Write this chapter.
To synthesize, construct an in-memory database and call synthesize
on a Spec
:
let db = FilesDatabase::new(None, true, 1, 128, 1, None);
let implementation = spec.synthesize(&db, None);
Re-Targeting
TODO: Fill in. Retargeting involves implementing the Target
trait and, likely, its associated types.