A Learned Performance Model for Tensor Processing Units

Samuel J. Kaufman
Phitchaya Mangpo Phothilimthana
Yanqi Zhou
Charith Mendis
Sudip Roy
Amit Sabne
Mike Burrows

MLSys 2021

Used in production at Google.

BibTeX

@inproceedings{KaufmanMLSys2021,
  title = {A Learned Performance Model for Tensor Processing Units},
  author = {
    Samuel J. Kaufman and
    Phitchaya Mangpo Phothilimthana and
    Yanqi Zhou and
    Charith Mendis and
    Sudip Roy and
    Amit Sabne and
    Mike Burrows
  },
  booktitle={
    Proceedings of
    Machine Learning and Systems
  },
  volume={3},
  pages={387--400},
  year={2021}
}

Abstract

Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for a specific program. However, they are difficult to develop because contemporary processors are complex, and the recent proliferation of deep learning accelerators has increased the development burden. We demonstrate a method of learning performance models from a corpus of tensor computation graph programs for Tensor Processing Unit (TPU) instances. We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks—tile-size selection and operator fusion—and that it helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive.