Apple M1 Max GPU: Ferrari or Fiat 600?


So I got a question from a fellow graphics dev, where he was waiting to get his hands on an M1 Max to see how much he can push on it, and his main question was: Should I treat it like a mobile GPU or Desktop GPU?

So what does that even mean?

Even though I don’t own an M1 model, its GPU is based on the iPhone GPU but on steroids. Hence I’m quite familiar with it.

It’s powerful

The first thing we need to get straight: on synthetic benchmarks it’s an iGPU that goes toes to toes with the latest GeForce 3080L and Radeon 6800M with a fraction of the energy consumption. That’s an incredible feat and everyone involved at Apple deserves an applause. They also significantly raised the bar of what the standard should be (in terms of performance and energy consumption)

From these benchmarks alone we could just conclude “treat it like a desktop GPU powerhouse” and finish it there.

But…

It’s also TBDR

If we look at ‘gaming performance’ from the same Anandtech article linked above, it’s a bit disappointing that it’s suddenly 33%-50% the performance of GeForce 3080L and Radeon 6800M for both Tomb Raider and Borderlands 3.

Don’t get me wrong: it’s still incredible that such low power iGPU is in the same graph as these two monsters.

But why such gap? The article offers a few explanations:

For Tomb Raider:

the M1 Max in particular is CPU limited at 1080p; the x86-to-Arm translation via Rosetta is not free, and even though Apple’s CPU cores are quite powerful, they’re hitting CPU limitations here. We have to go to 4K just to help the M1 Max fully stretch its legs. Even then the 16-inch MacBook Pro is well off the 6800M. Though we’re definitely GPU-bound at this point, as reported by both the game itself, and demonstrated by the 2x performance scaling from the M1 Pro to the M1 Max.

For Borderlands:

The game seems to be GPU-bound at 4K, so it’s not a case of an obvious CPU bottleneck. And truthfully, I don’t enough about the porting work that went into the Mac version to say whether it’s even a good port to begin with. So I’m hesitant to lay this all on the GPU, especially when the M1 Max trails the RTX 3080 by over 50%

Respectfully, I disagree. Once the game is GPU bottleneck at 4K, there is no much room for screw up. This is rarely a “bad port” problem (unless you’re sending the framebuffer every frame to the CPU, I’m looking at you Horizon Zero Dawn). A bad port will use the CPU unnecessarily, or leave alpha blending with alpha = 0 for lots of triangles, affecting all GPUs equally.
We’re seeing at raw power and the ability of Metal’s shader compiler to optimize.

I doubt it can be fixed with a driver update either. The games running on Rosetta holds some ground specially at 1080p, but I’m not convinced it can entirely explain a 50-66% gap. It’s probable this gap can be shortened, but I suspect there will always be some gap (unless optimized explicitly for this GPU).

So why is (likely) such difference?

AFAIK The M1 Max GPU is a Tile Based Deferred Renderer. With all its pros and weaknesses:

Pros

  • Massive energy efficiency
  • Programmable blending support
  • Pixel shaders only run once for opaque pixels
  • Deferred Shaders and postprocessing techniques that don’t read neighbour pixels can be implemented more efficiently (but code needs to be written specifically to take advantage of that)

Cons

  • It has trouble scaling with large vertex counts. Immediate mode GPUs have trouble too, but that problem often can go away with proper use of mesh shaders.
  • It has trouble with alpha testing
  • It has trouble with depth out
  • It has trouble with modifying msaa coverage

If you’re a mobile GPU dev, probably none of these are news to you.

Now, Apple has assured multiple times their TBDR implementation will have no problem with the “usual vertex counts games” have. Indeed as I said it’s incredible what they’re achieving but I can’t help remaining some degree of skepticism, since some games can push a lot of vertices.

And the fact that both TR and B3 are showing the same symptoms where synthetic benchmarks are not make me suspect TBDR limitations are showing up.

Ok so it’s a powerhouse and TBDR. What do I do?

My recommendation to this dev was: treat it like a desktop GPU powerhouse (i.e. it’s a Ferrari!) but with the following remarks:

  • Be ready to be more aggressive with geometry LOD. If your games pushes millions and millions of vertices every frame, this will hurt a lot more on the M1.
  • Do more aggressive (vertex) culling, probably via compute (i.e. ala GeometryFX). These chips clearly have the raw compute power for it
  • Beware of pushing too many unnecessary vertices during the shadow mapping passes
  • Experiment with alpha blending over alpha testing. AAA games use alpha testing a lot. Particularly hashed alpha testing to mimic alpha blending. This has a lot of convenient properties like being Order Independent, consuming less Bandwidth, being able to write to depth. But TBDRs don’t like it. It may be better to use alpha blending or take advantage of its programmable blending interface
  • Render your alpha tested geometry last. Don’t do it first nor interleaved with opaque geometry.
  • Fading LODs in the distance using hashed alpha testing is extremely common. Try profiling what happens when you toggle that off
  • If you’re doing esoteric stuff like depth out and msaa coverage alteration; try to avoid it, or leave it for last
  • Coalesce your render passes to minimize traffic (i.e. make best use of load, store, dont_care, resolve actions). This will improve performance and more importantly minimize battery consumption. You should already be doing this not only for TBDR, but also for NVIDIA and AMD as they’ve been hybrid tilers since the GeForce 2080 and Vega.

Btw. all of this applies to the iPhone too. The iPhone GPU is also quite powerful, way above of what can be found in their Android counterparts.