Exploring Tinygrad

Nick Doiron
3 min readNov 9, 2022

The machine learning space is dominated by two Python frameworks — PyTorch and TensorFlow (with Keras and JAX notable but not as popular). As these have developed and ossified, it’s difficult to imagine a newcomer to ML reading through and developing a model from first principles. This learning problem led Tesla’s Andrej Karpathy to write micrograd scripts in April 2020, which in turn inspired George Hotz (geohot) to start tinygrad six months later. There’s been some maintenance and development since, but it is trending again now that geohot is changing his role at Comma.ai.

Though Tinygrad is large enough to work as a neural network framework, it will never be above 1,000 lines of code (this raises some questions about formatting and counting lines of ‘core’ code; for example GPU accelerator code and sample code are kept in other folders, and the repo does not use typical Python formatters).

I’m not a “first principles” person, because I learned web dev by example, and I’d had math classes learning Fourier transforms and eigenvalues and the like without getting a handle on them. But recently I’ve become increasingly interested in optimizers and language model decoders, and not had a good idea of how to write that kind of core code. I don’t want to try and wedge something into the (already full-featured) main frameworks, so it could be better to put it into Tinygrad.

Running Tinygrad examples

I started out with an MNIST GAN example, but it had NaN metrics. I wasn’t able to work through yolo or one of the efficientnet examples either, with the errors coming from Tinygrad code (i.e. not a dependencies issue).

I might be running in an unusual space because I’m on CPU, but the first few examples were hit-or-miss.

Adapting PyTorch Optimizers

Tinygrad ships with three optimizers: Adam, SGD, and RMSprop. I chose to investigate PyTorch-supported variants Adamax and ASGD. (See github.com/mapmeld/tinygrad )

  • Adamax is “Adam based on infinity norm”. Sounds like one small change, though I don’t understand much. After comparing files and parameters in PyTorch, I make Adamax a subclass of Adam, with a pass-through constructor (except for a different default learning rate).

Counter to the ‘first principles’ philosophy, it was peculiar that the optimizer has self.t, self.m, self.v. Tinygrad code is translatable to how PyTorch behaves for a single differentiable tensor, for example:

# PyTorch
exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
# Tinygrad
self.m[i] = self.b1 * self.m[i] + (1.0 - self.b1) * t.grad

I ended up not finding direct examples for copy and amax to fit the code for this optimizer.

  • ASGD is ‘Averaged Stochastic Gradient Descent’.

This is closer to the SGD optimizer than Adamax was to Adam. I reached a block here where the PyTorch parent appears to share and update some values which are not accessible in Tinygrad. Also it has values such as weight decay which also exist in the original Adam/SGD, but are not considered in the Tinygrad world.

In both cases I could not get high accuracy. So I am not sure if I implemented a valid optimizer.

An Optimizer from Research: Adan

After a few attempts with other optimizers, the code was in adan.py looked easier to parse through.
The problem for me as a developer (before I would consider submitting a PR, line count or not) is I couldn’t easily prove if my interpretation of the code was matching the research optimizer, or if my adapting the code sanded off some of the important distinctions from the standard Adam optimizer. Again, scheduling the learning rate or decaying the weights could be an important factor.

Future work

  • I think it would be possible to make a word2vec example?
  • Making some consistent API on scheduling learning rate / weight decay