ashen-earth/posts/2023-09-05_wasm-game-of-lif...

---
title: "Pure Wasm Life 2: Optimizing Webassembly and Canvas"
subtitle: You know I couldn't leave it alone
resources:
- wasm-life-2/controller.js
- wasm-life-2/game.wat
- wasm-life-2/vertex.glsl
- wasm-life-2/fragment.glsl
---

> ***Note:*** This post is part 2 of my Webassembly Game of Life series, if you haven't already
> you can read part 1 [here](/wasm-game-of-life-1).

So when I posted my last Webassembly Game of Life post I had a few
improvements I wanted to make to it.  Some that I mentioned in that post
(like bitpacking the board so it took less memory), and some that I mentioned
to friends who read it (like switching from Canvas2D to WebGL or WebGPU so the
drawing code stays performant with more cells).  This post will not address
all of those, but I have done some significant work to make my implementation
faster.

I'll get into the details in a bit, but first let's show off the new board!
This version is drawing at one pixel per game cell (whereas the old version the
game cells were 3 pixels wide), so it's simulating and drawing **nine times as many cells**
without a noticeable difference to performance:

<noscript>[If you enable Javascript, you'll see a game board here]</noscript>
<canvas
  id="game"
  width="800"
  height="600"
  data-pixelsize="1"
  style="width:100%; aspect-ratio: 4/3; image-rendering: pixelated;"
/>
<p style="display: flex; align-items: flex-start; margin-top: 4px; height: 64px">
  <span style="flex: 1" id="frameTimes"/>
  <button id="reset">Reset</button>
</p>
<ScriptLoader src="/wasm-life-2/controller.js"
  wasm="/wasm-life-2/game.wat"
  vertexSrc="/wasm-life-2/vertex.glsl"
  fragmentSrc="/wasm-life-2/fragment.glsl"
  canvas="#game"
></ScriptLoader>

My performance target for this was to still run at 60 frames per second on
my 2019 Macbook Pro.  The previous version averaged around 10-11ms per frame
(well within the target 16.6ms for keeping that framerate), and I'm happy to
report that this version takes about 11-12ms per frame on that same machine!

Let's break down what I changed to get here.

## Overview

> ***A note on performance testing:***  Javascript is (in most browsers these days) a JIT-executed language,
with dynamic optimized re-compilation.  This can make performance testing of it
*very hard*, and the performance testing I do for this post is not particularly
rigorous.
>
> With that said, all benchmarks were done on the same hardware, in the
same power state, with the same browser, and the same applications open.  While
not an entirely controlled environment, this is *good enough* to get the overall
picture of how much my various attempts at optimization impacted things.

To get an idea of where things were started, I ran the previous post's code unmodified,
but with the larger 800x600 board size I wanted to target for this post.  I then
undertook a few different passes at optimization: first minor Webassembly changes
to the game update code, then rewriting the drawing code with WebGL, and then
lastly some more drastic Webassembly rewrites.

You can see the average time each of these versions took on my Macbook
in the table below:

> | Version | Time taken for game tick | Time taken to draw board | Computed framerate |
> |---|---|---|---|---|
> | Original | 46ms | 70ms | 8.6 fps |
> | Minor WASM Changes | 45.8ms | 70ms | 8.6 fps |
> | WebGL Drawing | 44.8ms | 0.4ms | 22.1 fps |
> | Major WASM Changes | 11.8ms | 0.4ms | 81.9\* fps |
>
> _\* This computed framerate is greater than my monitor can display, so the browser caps is to 60 fps_

Right off the bat, there were no simulation / game tick changes between the second and
third rows, but we do see a change in the *measured* tick time - this indicates that my measurement method has **at least** a millisecond
or two of inaccuracy, so we can conclude pretty quickly that my initial Webassembly
changes were not particularly useful as optimizations.

With that said though, the WebGL drawing was *very* significant, and my additional
Webassembly changes following those were definitely also a huge part of how I
reached my target framerate, so let's talk about those in turn.

### Drawing Game of Life with WebGL

Taking a look at my old drawing code, I do want to start out and say that
I didn't do anything *obviously* wrong.  Iterating through every cell on the
board is an `O(n)` algorithm, but for small board sizes it worked great.  However,
major browsers have had support for WebGL since about mid-2013 (even earlier if
you didn't care about Internet Explorer!) so it's safe to say that I could stand
to use the GPU for this.

I'm not going to go through all the WebGL initialization code, but let's briefly
look at what Javascript code is run on every frame:

```javascript
function drawBoard() {
  const { gameExports, width, height, gl } = gameState
  const wasmBuffer = gameExports.shared_memory.buffer
  const boardPointer = gameExports.getBoardPointer()
  const boardLength = gameExports.getBoardLength()
  const boardData = (new Uint8Array(wasmBuffer)).slice(boardPointer, boardPointer + boardLength)

  gl.texImage2D(gl.TEXTURE_2D, 0, gl.ALPHA, width, height, 0, gl.ALPHA, gl.UNSIGNED_BYTE, boardData)
  gl.drawArrays(gl.TRIANGLES, 0, 6)
}
```

And that's it!

No loops, no iterating over the board, it's *basically nothing*.  I just ask the
Webassembly part of my code for a pointer, construct an array view into the
memory buffer, and pass it directly to WebGL as a texture.

This simplicity comes from the fact that I'm storing my board as one byte per cell,
which means I can use the `gl.ALPHA` and `gl.UNSIGNED_BYTE` flags to tell WebGL
to expect a texture of **exactly** that format, without any preliminary data
transformation.

Then the WebGL fragment shader just needs to read that data from the texture's
alpha channel like so:

```glsl
precision highp float;
varying vec2 texCoords;
uniform sampler2D textureSampler;
uniform vec3 drawColor;
uniform vec3 backgroundColor;

void main() {
  vec4 textureColor = texture2D(textureSampler, texCoords);
  float alphaValue = textureColor.a;

  vec3 resultColor = mix(backgroundColor, drawColor, alphaValue * 255.0);

  gl_FragColor = vec4(resultColor, 1);
}
```

There's a correction factor here (multiplying the value by 255), but that's because
my board data had only 0s and 1s in the bytes for indicating an alive or dead
cell, and the shader expects that texture value to have a range between 0 and 255 (the range of a byte).

The shader also takes two `uniform` parameters for what colors to draw with, so
I will show you where those get set:

```javascript
export async function onThemeChange() {
  const {gl, shader} = gameState

  const drawColorString = getComputedStyle(gameState.canvas).getPropertyValue('color')
  const drawColor = parseColorString(drawColorString)

  const backgroundColorString = getComputedStyle(gameState.canvas).getPropertyValue('--background')
  const backgroundColor = parseColorString(backgroundColorString)

  const drawColorLocation = gl.getUniformLocation(shader, "drawColor")
  gl.uniform3fv(drawColorLocation, drawColor)

  const backgroundColorLocation = gl.getUniformLocation(shader, "backgroundColor")
  gl.uniform3fv(backgroundColorLocation, backgroundColor)

  drawBoard()
}
```

This is again pretty simple, just pulling some values out of the CSS properties,
parsing them, and then updating WebGL's parameters.  This gets run once on page
initialization, and then once each time the "Appearance" button in the page navigation
is clicked and the page switches between light and dark theme.

That's pretty much it for the drawing, so let's talk about how I made the actual
game update tick faster now.

### Optimizing the actual Webassembly code

So I don't know that much about how Webassembly execution is implemented into
browsers, but I know enough about actual CPU architectures to make some reasonable
inferences.

As I started thinking about more significant algorithm changes
I could make, I suspected that the largest contributor to time taken in my old
algorithm was probably from it taking longer to access the Webassembly linear memory
than local or global variables.  At an implementation level this could be because
of how the wasm memory relates to the CPU cache, but I didn't really care about that
 \- I was just curious to see if reducing the number of memory operations my algorithm
took could provide me more of a speed-up.

I knew I was already pretty inefficient in this area - my previous algorithm had made ***10 memory calls per cell***
(8 to check neighbor states, 1 to check the current cell's state, and 1 to store
the updated cell's state), so I suspected I could get that much lower.

#### Reducing memory accesses

I started by dividing up my `$getNewValueAtPosition` function into a `$getNeighborCount`
and `$getOutcomeFromCountAndCurrent`, so that I could have different logic
for counting the number of alive neighbors on border cells and internal cells while
still reusing the logic for actually figuring out what happens to a cell.

Then the `$getNeighborCount` function became `$getNeighborCountBoundary` and
`$getNeighborCountCenter` - because for optimizing memory access, I didn't want
to have to worry about reading something partially aligned outside the board.
I decided the old algorithm would work well enough for boundary cells for that
reason.

So my updated `$getNewValueAtPosition` function looks like this:

```wasm
(func $getNewValueAtPosition (param $row i32) (param $column i32) (result i32)
  (local $count i32)
  (local $current i32)
  local.get $row
  i32.const 1
  i32.lt_u

  local.get $row
  global.get $boardHeight
  i32.const 1
  i32.sub
  i32.ge_u

  local.get $column
  i32.const 1
  i32.lt_u

  local.get $column
  global.get $boardWidth
  i32.const 1
  i32.sub
  i32.ge_u

  ;; if any of the boundary conditions are true
  i32.or
  i32.or
  i32.or

  if (result i32 i32)
    local.get $row
    local.get $column
    call $getNeighborCountBoundary
  else
    local.get $row
    local.get $column
    call $getNeighborCountCenter
  end

  call $getOutcomeFromCountAndCurrent
)
```

It first checks to see if the cell is along any of the four boundaries, and if
so calls `$getNeighborCountBoundary`.  If not it calls `$getNeighborCountCenter`,
and either of these are expected to return ***two*** values, which are passed into
`$getOutcomeFromCountAndCurrent`.

You might be wondering why a neighbor count function would return two values, and
that's because it turns out it's more efficient to retrieve the current cell's state
along with neighbor states (as we'll see in a bit) - so the funciton names are admittedly
a bit of a misnomer, as they get the count of alive neighbors ***and*** whether the
current cell is alive as well.

So the `$getNeighborCountBoundary` function is the exact same as what I used for
every cell last time, but let's look at that new `$getNeighborCountCenter` function:

```wasm
(func $getNeighborCountCenter (param $row i32) (param $column i32) (result i32 i32) ;; neighborCount, currentValue
  (local $origIndex i32)
  (local $rowAbove i32)
  (local $rowCenter i32)
  (local $rowBelow i32)

  ;; store current cell's memory position
  call $getBoardPtr
  local.get $row
  local.get $column
  call $getIndexForPosition
  i32.add
  local.tee $origIndex

  ;; load the three rows
  i32.const 1
  i32.sub
  i32.load
  local.set $rowCenter

  local.get $origIndex
  i32.const 1
  i32.sub
  global.get $boardWidth
  i32.sub
  i32.load
  local.set $rowAbove

  local.get $origIndex
  i32.const 1
  i32.sub
  global.get $boardWidth
  i32.add
  i32.load
  local.set $rowBelow

  ;; count the number of alive neighbors in each row
  local.get $rowAbove
  i32.const 0x00_01_01_01
  i32.and
  i32.popcnt

  local.get $rowCenter
  i32.const 0x00_01_00_01
  i32.and
  i32.popcnt


  local.get $rowBelow
  i32.const 0x00_01_01_01
  i32.and
  i32.popcnt

  ;; sum neighbor count
  i32.add
  i32.add

  ;; get current
  local.get $rowCenter
  i32.const 0x00_00_01_00
  i32.and
  i32.popcnt
)
```

The first major change to notice is that I'm using the regular `i32.load`
instead of the `i32.load8_u` function I was using before.  This function is
pulling back 32 bytes at a time, and since the cells are in consecutive bytes
for each row, that means I just need one load instruction per row.

Each row is loaded after computing its memory offset using the `$boardWidth`
variable, and when loaded the cells I'm looking for should be in the three
least significant bytes (as multi-byte reads in WASM are little-endian).
Following the load calls I can use some bitwise operations to actually isolate and
count the live cell bits, and then I use another bitwise and operation to take
the already retrieved value for the center row and isolate the current cell's state.

So overall, three memory reads per cell (and there will be one more to write
it back, so four total per cell).

#### One last micro-optimization

Lastly let's take a quick look at that new `$getOutcomeFromCountAndCurrent`
function.  Of all the smaller changes I made, this is the one that actually
seemed to make the most difference, and it relies on reducing the amount of
branching there is in the cell determination.

First, here's what this block looked like in the last post for comparison:

```wasm
;; Exactly 3 neighbors
local.tee $count
i32.const 3
i32.eq

if
  ;; becomes or stays alive
  i32.const 1
  return
end

;; If currently dead
local.get $row
local.get $column
call $getValueAtPosition
i32.eqz
if
  ;; Stay dead
  i32.const 0
  return
end

;; 2 neighbors
local.get $count
i32.const 2
i32.eq
if
  i32.const 1
  return
end

;; otherwise dead
i32.const 0
```

It's certainly not *bad* - I mean there's three different `if` branches,
and an extra memory access here to look up the current state even though
we could have already gotten that before... but it's readable right?

Now let's look at the new function which replaced this block:

```wasm
(func $getOutcomeFromCountAndCurrent (param $neighborCount i32) (param $current i32) (result i32)
  ;; get sentinal value for current count
  i32.const 1
  local.get $neighborCount
  i32.shl

  ;; get value mask for current state
  i32.const 0xC ;;alive mask (1100, 3 or 2 neighbors)
  i32.const 0x8 ;;dead mask  (1000, exactly three neighbors)
  local.get $current
  select

  ;; mask out to see if we still have a bit
  i32.and
  i32.popcnt
)
```

Yeah okay no matter how much I think my first implementation was okay - I
can't really defend it when the alternate version is half as many instructions,
several of them are constant declarations, and it has *literally no branching*.

I wish I had known about the `i32.popcnt` and `select` instructions before,
they're *incredibly convenient* for stuff like this.  This change wasn't nearly
as significant as the memory access reduction before, but contributed about a 5ms
reduction to my frame times overall.

## What's next?

I will probably be continuing with this project again soon.  Maybe not as the
immediate next post I write again, but I am planning on trying a few more
significant optimizations.

The most notable one is I'd still like to bit-pack the board, and update my main
cell loop to update several at once.  Off the top of my head it seems like I
could use this to extend my 3-read/1-write update strategy to do up to 16 cells
at once (since I need one more cell to either side, and memory addresses must be
byte-aligned).

I also want to experiment with tracking the active regions of the board, as an
area which didn't update in the last frame will again not update in the current
frame.  Where I'm starting with an entirely random board I don't know how much
performance that will get me, but it's worth a try.

Anyways, that's all for now - I hope you enjoyed this continued exploration of
by-hand Webassembly programming.