A Learned Performance Model for Tensor Processing Units

MLSys 2021
Used in production at Google.
BibTeX
@inproceedings{KaufmanMLSys2021,
  title = {A Learned Performance Model for Tensor Processing Units},
  author = {
    Samuel J. Kaufman and
    Phitchaya Mangpo Phothilimthana and
    Yanqi Zhou and
    Charith Mendis and
    Sudip Roy and
    Amit Sabne and
    Mike Burrows
  },
  booktitle={
    Proceedings of
    Machine Learning and Systems
  },
  volume={3},
  pages={387--400},
  year={2021}
}

Abstract

Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for a specific program. However, they are difficult to develop because contemporary processors are complex, and the recent proliferation of deep learning accelerators has increased the development burden. We demonstrate a method of learning performance models from a corpus of tensor computation graph programs for Tensor Processing Unit (TPU) instances. We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks—tile-size selection and operator fusion—and that it helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive.