ashen-earth/posts/2023-09-05_wasm-game-of-lif...

---
title: "Pure Wasm Life 2: Optimizing Webassembly and Canvas"
subtitle: You know I couldn't leave it alone
resources:
- wasm-life-2/controller.js
- wasm-life-2/game.wat
- wasm-life-2/vertex.glsl
- wasm-life-2/fragment.glsl
---

> ***Note:*** This post is part 2 of my Webassembly Game of Life series, if you haven't already
> you can read part 1 [here](/wasm-game-of-life-1).

So when I posted my last Webassembly Game of Life post I had a few
improvements I wanted to make to it.  Some that I mentioned in that post
(like bitpacking the board so it took less memory), and some that I mentioned
to friends who read it (like switching from Canvas2D to WebGL or WebGPU so the
drawing code stays performant with more cells).  This post will not address
all of those, but I have done some significant work to make my implementation
faster.

I'll get into the details in a bit, but first let's show off the new board!
This version is drawing at one pixel per game cell (whereas the old version the
game cells were 3 pixels wide), so it's simulating and drawing **nine times as many cells**
without a noticeable difference to performance:

<noscript>[If you enable Javascript, you'll see a game board here]</noscript>
<canvas
  id="game"
  width="800"
  height="600"
  data-pixelsize="1"
  style="width:100%; aspect-ratio: 4/3; image-rendering: pixelated;"
/>
<p style="display: flex; align-items: flex-start; margin-top: 4px; height: 64px">
  <span style="flex: 1" id="frameTimes"/>
  <button id="reset">Reset</button>
</p>
<ScriptLoader src="/wasm-life-2/controller.js"
  wasm="/wasm-life-2/game.wat"
  vertexSrc="/wasm-life-2/vertex.glsl"
  fragmentSrc="/wasm-life-2/fragment.glsl"
  canvas="#game"
></ScriptLoader>

My performance target for this was to still run at 60 frames per second on
my 2019 Macbook Pro.  The previous version averaged around 10-11ms per frame
(well within the target 16.6ms for keeping that framerate), and I'm happy to
report that this version takes about 11-12ms per frame on that same machine!

Let's break down what I changed to get here.

## Overview

> ***A note on performance testing:***  Javascript is (in most browsers these days) a JIT-executed language,
with dynamic optimized re-compilation.  This can make performance testing of it
*very hard*, and the performance testing I do for this post is not particularly
rigorous.
> 
> With that said, all benchmarks were done on the same hardware, in the
same power state, with the same browser, and the same applications open.  While
not an entirely controlled environment, this is *good enough* to get the overall
picture of how much my various attempts at optimization impacted things.

To get an idea of where things were started, I ran the previous post's code unmodified,
but with the larger 800x600 board size I wanted to target for this post.  I then
undertook a few different passes at optimization: first minor Webassembly changes
to the game update code, then rewriting the drawing code with WebGL, and then
lastly some more drastic Webassembly rewrites.

You can see the average time each of these versions took on my Macbook
in the table below:

> | Version | Time taken for game tick | Time taken to draw board | Computed framerate |
> |---|---|---|---|---|
> | Original | 46ms | 70ms | 8.6 fps |
> | Minor WASM Changes | 45.8ms | 70ms | 8.6 fps |
> | WebGL Drawing | 44.8ms | 0.4ms | 22.1 fps |
> | Major WASM Changes | 11.8ms | 0.4ms | 81.9\* fps |
> 
> _\* This computed framerate is greater than my monitor can display, so the browser caps is to 60 fps_

Right off the bat, there were no simulation / game tick changes between the second and
third rows, but we do see a change in the *measured* tick time - this indicates that my measurement method has **at least** a millisecond
or two of inaccuracy, so we can conclude pretty quickly that my initial Webassembly
changes were not particularly useful as optimizations.

With that said though, the WebGL drawing was *very* significant, and my additional
Webassembly changes following those were definitely also a huge part of how I
reached my target framerate, so let's talk about those in turn.

### Drawing Game of Life with WebGL

Taking a look at my old drawing code, I do want to start out and say that
I didn't do anything *obviously* wrong.  Iterating through every cell on the
board is an `O(n)` algorithm, but for small board sizes it worked great.  However,
major browsers have had support for WebGL since about mid-2013 (even earlier if
you didn't care about Internet Explorer!) so it's safe to say that I could stand
to use the GPU for this.

I'm not going to go through all the WebGL initialization code, but let's briefly
look at what Javascript code is run on every frame:

```javascript
function drawBoard() {
  const { gameExports, width, height, gl } = gameState
  const wasmBuffer = gameExports.shared_memory.buffer
  const boardPointer = gameExports.getBoardPointer()
  const boardLength = gameExports.getBoardLength()
  const boardData = (new Uint8Array(wasmBuffer)).slice(boardPointer, boardPointer + boardLength)

  gl.texImage2D(gl.TEXTURE_2D, 0, gl.ALPHA, width, height, 0, gl.ALPHA, gl.UNSIGNED_BYTE, boardData)
  gl.drawArrays(gl.TRIANGLES, 0, 6)
}
```

And that's it!

No loops, no iterating over the board, it's *basically nothing*.  I just ask the
Webassembly part of my code for a pointer, construct an array view into the
memory buffer, and pass it directly to WebGL as a texture.

This simplicity comes from the fact that I'm storing my board as one byte per cell,
which means I can use the `gl.ALPHA` and `gl.UNSIGNED_BYTE` flags to tell WebGL
to expect a texture of **exactly** that format, without any preliminary data
transformation.

Then the WebGL fragment shader just needs to read that data from the texture's
alpha channel like so:

```glsl
precision highp float;
varying vec2 texCoords;
uniform sampler2D textureSampler;
uniform vec3 drawColor;
uniform vec3 backgroundColor;

void main() {
  vec4 textureColor = texture2D(textureSampler, texCoords);
  float alphaValue = textureColor.a;

  vec3 resultColor = mix(backgroundColor, drawColor, alphaValue * 255.0);

  gl_FragColor = vec4(resultColor, 1);
}
```

There's a correction factor here (multiplying the value by 255), but that's because
my board data had only 0s and 1s in the bytes for indicating an alive or dead
cell, and the shader expects that texture value to have a range between 0 and 255 (the range of a byte).

The shader also takes two `uniform` parameters for what colors to draw with, so
I will show you where those get set:

```javascript
export async function onThemeChange() {
  const {gl, shader} = gameState

  const drawColorString = getComputedStyle(gameState.canvas).getPropertyValue('color')
  const drawColor = parseColorString(drawColorString)

  const backgroundColorString = getComputedStyle(gameState.canvas).getPropertyValue('--background')
  const backgroundColor = parseColorString(backgroundColorString)

  const drawColorLocation = gl.getUniformLocation(shader, "drawColor")
  gl.uniform3fv(drawColorLocation, drawColor)

  const backgroundColorLocation = gl.getUniformLocation(shader, "backgroundColor")
  gl.uniform3fv(backgroundColorLocation, backgroundColor)

  drawBoard()
}
```

This is again pretty simple, just pulling some values out of the CSS properties,
parsing them, and then updating WebGL's parameters.  This gets run once on page
initialization, and then once each time the "Appearance" button in the page navigation
is clicked and the page switches between light and dark theme.

That's pretty much it for the drawing, so let's talk about how I made the actual
game update tick faster now.

### Optimizing the actual Webassembly code

So I don't know that much about how Webassembly execution is implemented into
browsers, but I know enough about actual CPU architectures to make some reasonable
inferences.

As I started thinking about more significant algorithm changes
I could make, I suspected that the largest contributor to time taken in my old
algorithm was probably from it taking longer to access the Webassembly linear memory
than local or global variables.  At an implementation level this could be because
of how the wasm memory relates to the CPU cache, but I didn't really care about that
 \- I was just curious to see if reducing the number of memory operations my algorithm
took could provide me more of a speed-up.

I knew I was already pretty inefficient in this area - my previous algorithm had made ***10 memory calls per cell***
(8 to check neighbor states, 1 to check the current cell's state, and 1 to store
the updated cell's state), so I suspected I could get that much lower.

#### Reducing memory accesses

I started by dividing up my `$getNewValueAtPosition` function into a `$getNeighborCount`
and `$getOutcomeFromCountAndCurrent`, so that I could have different logic
for counting the number of alive neighbors on border cells and internal cells while
still reusing the logic for actually figuring out what happens to a cell.

Then the `$getNeighborCount` function became `$getNeighborCountBoundary` and
`$getNeighborCountCenter` - because for optimizing memory access, I didn't want
to have to worry about reading something partially aligned outside the board.
I decided the old algorithm would work well enough for boundary cells for that
reason.

So my updated `$getNewValueAtPosition` function looks like this:

```wasm
(func $getNewValueAtPosition (param $row i32) (param $column i32) (result i32)
  (local $count i32)
  (local $current i32)
  local.get $row
  i32.const 1
  i32.lt_u

  local.get $row
  global.get $boardHeight
  i32.const 1
  i32.sub
  i32.ge_u

  local.get $column
  i32.const 1
  i32.lt_u

  local.get $column
  global.get $boardWidth
  i32.const 1
  i32.sub
  i32.ge_u

  ;; if any of the boundary conditions are true
  i32.or
  i32.or
  i32.or

  if (result i32 i32)
    local.get $row
    local.get $column
    call $getNeighborCountBoundary
  else
    local.get $row
    local.get $column
    call $getNeighborCountCenter
  end

  call $getOutcomeFromCountAndCurrent
)
```

It first checks to see if the cell is along any of the four boundaries, and if
so calls `$getNeighborCountBoundary`.  If not it calls `$getNeighborCountCenter`,
and either of these are expected to return ***two*** values, which are passed into
`$getOutcomeFromCountAndCurrent`.

You might be wondering why a neighbor count function would return two values, and
that's because it turns out it's more efficient to retrieve the current cell's state
along with neighbor states (as we'll see in a bit) - so the funciton names are admittedly
a bit of a misnomer, as they get the count of alive neighbors ***and*** whether the
current cell is alive as well.

So the `$getNeighborCountBoundary` function is the exact same as what I used for
every cell last time, but let's look at that new `$getNeighborCountCenter` function:

```wasm
(func $getNeighborCountCenter (param $row i32) (param $column i32) (result i32 i32) ;; neighborCount, currentValue
  (local $origIndex i32)
  (local $rowAbove i32)
  (local $rowCenter i32)
  (local $rowBelow i32)

  ;; store current cell's memory position
  call $getBoardPtr
  local.get $row
  local.get $column
  call $getIndexForPosition
  i32.add
  local.tee $origIndex

  ;; load the three rows
  i32.const 1
  i32.sub
  i32.load
  local.set $rowCenter

  local.get $origIndex
  i32.const 1
  i32.sub
  global.get $boardWidth
  i32.sub
  i32.load
  local.set $rowAbove

  local.get $origIndex
  i32.const 1
  i32.sub
  global.get $boardWidth
  i32.add
  i32.load
  local.set $rowBelow

  ;; count the number of alive neighbors in each row
  local.get $rowAbove
  i32.const 0x00_01_01_01
  i32.and
  i32.popcnt

  local.get $rowCenter
  i32.const 0x00_01_00_01
  i32.and
  i32.popcnt


  local.get $rowBelow
  i32.const 0x00_01_01_01
  i32.and
  i32.popcnt

  ;; sum neighbor count
  i32.add
  i32.add

  ;; get current
  local.get $rowCenter
  i32.const 0x00_00_01_00
  i32.and
  i32.popcnt
)
```

The first major change to notice is that I'm using the regular `i32.load`
instead of the `i32.load8_u` function I was using before.  This function is
pulling back 32 bytes at a time, and since the cells are in consecutive bytes
for each row, that means I just need one load instruction per row.

Each row is loaded after computing its memory offset using the `$boardWidth`
variable, and when loaded the cells I'm looking for should be in the three
least significant bytes (as multi-byte reads in WASM are little-endian).
Following the load calls I can use some bitwise operations to actually isolate and
count the live cell bits, and then I use another bitwise and operation to take
the already retrieved value for the center row and isolate the current cell's state.

So overall, three memory reads per cell (and there will be one more to write
it back, so four total per cell).

#### One last micro-optimization

Lastly let's take a quick look at that new `$getOutcomeFromCountAndCurrent`
function.  Of all the smaller changes I made, this is the one that actually
seemed to make the most difference, and it relies on reducing the amount of
branching there is in the cell determination.

First, here's what this block looked like in the last post for comparison:

```wasm
;; Exactly 3 neighbors
local.tee $count
i32.const 3
i32.eq

if
  ;; becomes or stays alive
  i32.const 1
  return
end

;; If currently dead
local.get $row
local.get $column
call $getValueAtPosition
i32.eqz
if
  ;; Stay dead
  i32.const 0
  return
end

;; 2 neighbors
local.get $count
i32.const 2
i32.eq
if
  i32.const 1
  return
end

;; otherwise dead
i32.const 0
```

It's certainly not *bad* - I mean there's three different `if` branches,
and an extra memory access here to look up the current state even though
we could have already gotten that before... but it's readable right?

Now let's look at the new function which replaced this block:

```wasm
(func $getOutcomeFromCountAndCurrent (param $neighborCount i32) (param $current i32) (result i32)
  ;; get sentinal value for current count
  i32.const 1
  local.get $neighborCount
  i32.shl

  ;; get value mask for current state
  i32.const 0xC ;;alive mask (1100, 3 or 2 neighbors)
  i32.const 0x8 ;;dead mask  (1000, exactly three neighbors)
  local.get $current
  select

  ;; mask out to see if we still have a bit
  i32.and
  i32.popcnt
)
```

Yeah okay no matter how much I think my first implementation was okay - I
can't really defend it when the alternate version is half as many instructions,
several of them are constant declarations, and it has *literally no branching*.

I wish I had known about the `i32.popcnt` and `select` instructions before,
they're *incredibly convenient* for stuff like this.  This change wasn't nearly
as significant as the memory access reduction before, but contributed about a 5ms
reduction to my frame times overall.

## What's next?

I will probably be continuing with this project again soon.  Maybe not as the
immediate next post I write again, but I am planning on trying a few more
significant optimizations.

The most notable one is I'd still like to bit-pack the board, and update my main
cell loop to update several at once.  Off the top of my head it seems like I
could use this to extend my 3-read/1-write update strategy to do up to 16 cells
at once (since I need one more cell to either side, and memory addresses must be
byte-aligned).

I also want to experiment with tracking the active regions of the board, as an
area which didn't update in the last frame will again not update in the current
frame.  Where I'm starting with an entirely random board I don't know how much
performance that will get me, but it's worth a try.

Anyways, that's all for now - I hope you enjoyed this continued exploration of
by-hand Webassembly programming.
Initial copy from life 1 post Just committing so I can check diffs and describe it later 1 year ago			`---`
			`title: "Pure Wasm Life 2: Optimizing Webassembly and Canvas"`
			`subtitle: You know I couldn't leave it alone`
			`resources:`
			`- wasm-life-2/controller.js`
			`- wasm-life-2/game.wat`
optimize drawing with webgl still has color selection issue on windows 1 year ago			`- wasm-life-2/vertex.glsl`
			`- wasm-life-2/fragment.glsl`
Initial copy from life 1 post Just committing so I can check diffs and describe it later 1 year ago			`---`

write-up for wasm game of life 2 1 year ago			`> *Note:* This post is part 2 of my Webassembly Game of Life series, if you haven't already`
			`> you can read part 1 [here](/wasm-game-of-life-1).`

			`So when I posted my last Webassembly Game of Life post I had a few`
			`improvements I wanted to make to it. Some that I mentioned in that post`
			`(like bitpacking the board so it took less memory), and some that I mentioned`
			`to friends who read it (like switching from Canvas2D to WebGL or WebGPU so the`
			`drawing code stays performant with more cells). This post will not address`
			`all of those, but I have done some significant work to make my implementation`
			`faster.`

			`I'll get into the details in a bit, but first let's show off the new board!`
			`This version is drawing at one pixel per game cell (whereas the old version the`
			`game cells were 3 pixels wide), so it's simulating and drawing nine times as many cells`
			`without a noticeable difference to performance:`
Initial copy from life 1 post Just committing so I can check diffs and describe it later 1 year ago
			`<noscript>[If you enable Javascript, you'll see a game board here]</noscript>`
			`<canvas`
			`id="game"`
			`width="800"`
			`height="600"`
			`data-pixelsize="1"`
			`style="width:100%; aspect-ratio: 4/3; image-rendering: pixelated;"`
			`/>`
			`<p style="display: flex; align-items: flex-start; margin-top: 4px; height: 64px">`
			`<span style="flex: 1" id="frameTimes"/>`
			`<button id="reset">Reset</button>`
			`</p>`
write-up for wasm game of life 2 1 year ago			`<ScriptLoader src="/wasm-life-2/controller.js"`
Initial copy from life 1 post Just committing so I can check diffs and describe it later 1 year ago			`wasm="/wasm-life-2/game.wat"`
optimize drawing with webgl still has color selection issue on windows 1 year ago			`vertexSrc="/wasm-life-2/vertex.glsl"`
			`fragmentSrc="/wasm-life-2/fragment.glsl"`
Initial copy from life 1 post Just committing so I can check diffs and describe it later 1 year ago			`canvas="#game"`
write-up for wasm game of life 2 1 year ago			`></ScriptLoader>`

			`My performance target for this was to still run at 60 frames per second on`
			`my 2019 Macbook Pro. The previous version averaged around 10-11ms per frame`
			`(well within the target 16.6ms for keeping that framerate), and I'm happy to`
			`report that this version takes about 11-12ms per frame on that same machine!`

			`Let's break down what I changed to get here.`

			`## Overview`

			`> *A note on performance testing:* Javascript is (in most browsers these days) a JIT-executed language,`
			`with dynamic optimized re-compilation. This can make performance testing of it`
			`very hard, and the performance testing I do for this post is not particularly`
			`rigorous.`
			`>`
			`> With that said, all benchmarks were done on the same hardware, in the`
			`same power state, with the same browser, and the same applications open. While`
			`not an entirely controlled environment, this is good enough to get the overall`
			`picture of how much my various attempts at optimization impacted things.`

			`To get an idea of where things were started, I ran the previous post's code unmodified,`
			`but with the larger 800x600 board size I wanted to target for this post. I then`
			`undertook a few different passes at optimization: first minor Webassembly changes`
			`to the game update code, then rewriting the drawing code with WebGL, and then`
			`lastly some more drastic Webassembly rewrites.`

			`You can see the average time each of these versions took on my Macbook`
			`in the table below:`

			`> \| Version \| Time taken for game tick \| Time taken to draw board \| Computed framerate \|`
			`> \|---\|---\|---\|---\|---\|`
			`> \| Original \| 46ms \| 70ms \| 8.6 fps \|`
			`> \| Minor WASM Changes \| 45.8ms \| 70ms \| 8.6 fps \|`
			`> \| WebGL Drawing \| 44.8ms \| 0.4ms \| 22.1 fps \|`
			`> \| Major WASM Changes \| 11.8ms \| 0.4ms \| 81.9\* fps \|`
			`>`
			`> _\* This computed framerate is greater than my monitor can display, so the browser caps is to 60 fps_`

			`Right off the bat, there were no simulation / game tick changes between the second and`
			`third rows, but we do see a change in the measured tick time - this indicates that my measurement method has at least a millisecond`
			`or two of inaccuracy, so we can conclude pretty quickly that my initial Webassembly`
			`changes were not particularly useful as optimizations.`

			`With that said though, the WebGL drawing was very significant, and my additional`
			`Webassembly changes following those were definitely also a huge part of how I`
			`reached my target framerate, so let's talk about those in turn.`

			`### Drawing Game of Life with WebGL`

			`Taking a look at my old drawing code, I do want to start out and say that`
			`I didn't do anything obviously wrong. Iterating through every cell on the`
			board is an `O(n)` algorithm, but for small board sizes it worked great. However,
			`major browsers have had support for WebGL since about mid-2013 (even earlier if`
			`you didn't care about Internet Explorer!) so it's safe to say that I could stand`
			`to use the GPU for this.`

			`I'm not going to go through all the WebGL initialization code, but let's briefly`
			`look at what Javascript code is run on every frame:`

			```javascript
			`function drawBoard() {`
			`const { gameExports, width, height, gl } = gameState`
			`const wasmBuffer = gameExports.shared_memory.buffer`
			`const boardPointer = gameExports.getBoardPointer()`
			`const boardLength = gameExports.getBoardLength()`
			`const boardData = (new Uint8Array(wasmBuffer)).slice(boardPointer, boardPointer + boardLength)`

			`gl.texImage2D(gl.TEXTURE_2D, 0, gl.ALPHA, width, height, 0, gl.ALPHA, gl.UNSIGNED_BYTE, boardData)`
			`gl.drawArrays(gl.TRIANGLES, 0, 6)`
			`}`
			```

			`And that's it!`

			`No loops, no iterating over the board, it's basically nothing. I just ask the`
			`Webassembly part of my code for a pointer, construct an array view into the`
			`memory buffer, and pass it directly to WebGL as a texture.`

			`This simplicity comes from the fact that I'm storing my board as one byte per cell,`
			which means I can use the `gl.ALPHA` and `gl.UNSIGNED_BYTE` flags to tell WebGL
			`to expect a texture of exactly that format, without any preliminary data`
			`transformation.`

			`Then the WebGL fragment shader just needs to read that data from the texture's`
			`alpha channel like so:`

			```glsl
			`precision highp float;`
			`varying vec2 texCoords;`
			`uniform sampler2D textureSampler;`
			`uniform vec3 drawColor;`
			`uniform vec3 backgroundColor;`

			`void main() {`
			`vec4 textureColor = texture2D(textureSampler, texCoords);`
			`float alphaValue = textureColor.a;`

			`vec3 resultColor = mix(backgroundColor, drawColor, alphaValue * 255.0);`

			`gl_FragColor = vec4(resultColor, 1);`
			`}`
			```

			`There's a correction factor here (multiplying the value by 255), but that's because`
			`my board data had only 0s and 1s in the bytes for indicating an alive or dead`
Proofreading changes 1 year ago			`cell, and the shader expects that texture value to have a range between 0 and 255 (the range of a byte).`
write-up for wasm game of life 2 1 year ago
			The shader also takes two `uniform` parameters for what colors to draw with, so
			`I will show you where those get set:`

			```javascript
			`export async function onThemeChange() {`
			`const {gl, shader} = gameState`

			`const drawColorString = getComputedStyle(gameState.canvas).getPropertyValue('color')`
			`const drawColor = parseColorString(drawColorString)`

			`const backgroundColorString = getComputedStyle(gameState.canvas).getPropertyValue('--background')`
			`const backgroundColor = parseColorString(backgroundColorString)`

			`const drawColorLocation = gl.getUniformLocation(shader, "drawColor")`
			`gl.uniform3fv(drawColorLocation, drawColor)`

			`const backgroundColorLocation = gl.getUniformLocation(shader, "backgroundColor")`
			`gl.uniform3fv(backgroundColorLocation, backgroundColor)`

			`drawBoard()`
			`}`
			```

			`This is again pretty simple, just pulling some values out of the CSS properties,`
			`parsing them, and then updating WebGL's parameters. This gets run once on page`
			`initialization, and then once each time the "Appearance" button in the page navigation`
			`is clicked and the page switches between light and dark theme.`

			`That's pretty much it for the drawing, so let's talk about how I made the actual`
			`game update tick faster now.`

			`### Optimizing the actual Webassembly code`

			`So I don't know that much about how Webassembly execution is implemented into`
			`browsers, but I know enough about actual CPU architectures to make some reasonable`
			`inferences.`

Proofreading changes 1 year ago			`As I started thinking about more significant algorithm changes`
write-up for wasm game of life 2 1 year ago			`I could make, I suspected that the largest contributor to time taken in my old`
			`algorithm was probably from it taking longer to access the Webassembly linear memory`
			`than local or global variables. At an implementation level this could be because`
			`of how the wasm memory relates to the CPU cache, but I didn't really care about that`
			`\- I was just curious to see if reducing the number of memory operations my algorithm`
			`took could provide me more of a speed-up.`

Proofreading changes 1 year ago			`I knew I was already pretty inefficient in this area - my previous algorithm had made *10 memory calls per cell*`
write-up for wasm game of life 2 1 year ago			`(8 to check neighbor states, 1 to check the current cell's state, and 1 to store`
			`the updated cell's state), so I suspected I could get that much lower.`

			`#### Reducing memory accesses`

			I started by dividing up my `$getNewValueAtPosition` function into a `$getNeighborCount`
			and `$getOutcomeFromCountAndCurrent`, so that I could have different logic
			`for counting the number of alive neighbors on border cells and internal cells while`
			`still reusing the logic for actually figuring out what happens to a cell.`

			Then the `$getNeighborCount` function became `$getNeighborCountBoundary` and
			`$getNeighborCountCenter` - because for optimizing memory access, I didn't want
			`to have to worry about reading something partially aligned outside the board.`
			`I decided the old algorithm would work well enough for boundary cells for that`
			`reason.`

			So my updated `$getNewValueAtPosition` function looks like this:

			```wasm
			`(func $getNewValueAtPosition (param $row i32) (param $column i32) (result i32)`
			`(local $count i32)`
			`(local $current i32)`
			`local.get $row`
			`i32.const 1`
			`i32.lt_u`

			`local.get $row`
			`global.get $boardHeight`
			`i32.const 1`
			`i32.sub`
			`i32.ge_u`

			`local.get $column`
			`i32.const 1`
			`i32.lt_u`

			`local.get $column`
			`global.get $boardWidth`
			`i32.const 1`
			`i32.sub`
			`i32.ge_u`

			`;; if any of the boundary conditions are true`
			`i32.or`
			`i32.or`
			`i32.or`

			`if (result i32 i32)`
			`local.get $row`
			`local.get $column`
			`call $getNeighborCountBoundary`
			`else`
			`local.get $row`
			`local.get $column`
			`call $getNeighborCountCenter`
			`end`

			`call $getOutcomeFromCountAndCurrent`
			`)`
			```

			`It first checks to see if the cell is along any of the four boundaries, and if`
			so calls `$getNeighborCountBoundary`. If not it calls `$getNeighborCountCenter`,
			`and either of these are expected to return *two* values, which are passed into`
			`$getOutcomeFromCountAndCurrent`.

			`You might be wondering why a neighbor count function would return two values, and`
			`that's because it turns out it's more efficient to retrieve the current cell's state`
			`along with neighbor states (as we'll see in a bit) - so the funciton names are admittedly`
			`a bit of a misnomer, as they get the count of alive neighbors *and* whether the`
			`current cell is alive as well.`

			So the `$getNeighborCountBoundary` function is the exact same as what I used for
			every cell last time, but let's look at that new `$getNeighborCountCenter` function:

			```wasm
			`(func $getNeighborCountCenter (param $row i32) (param $column i32) (result i32 i32) ;; neighborCount, currentValue`
			`(local $origIndex i32)`
			`(local $rowAbove i32)`
			`(local $rowCenter i32)`
			`(local $rowBelow i32)`

			`;; store current cell's memory position`
			`call $getBoardPtr`
			`local.get $row`
			`local.get $column`
			`call $getIndexForPosition`
			`i32.add`
			`local.tee $origIndex`

			`;; load the three rows`
			`i32.const 1`
			`i32.sub`
			`i32.load`
			`local.set $rowCenter`

			`local.get $origIndex`
			`i32.const 1`
			`i32.sub`
			`global.get $boardWidth`
			`i32.sub`
			`i32.load`
			`local.set $rowAbove`

			`local.get $origIndex`
			`i32.const 1`
			`i32.sub`
			`global.get $boardWidth`
			`i32.add`
			`i32.load`
			`local.set $rowBelow`

			`;; count the number of alive neighbors in each row`
			`local.get $rowAbove`
			`i32.const 0x00_01_01_01`
			`i32.and`
			`i32.popcnt`

			`local.get $rowCenter`
			`i32.const 0x00_01_00_01`
			`i32.and`
			`i32.popcnt`


			`local.get $rowBelow`
			`i32.const 0x00_01_01_01`
			`i32.and`
			`i32.popcnt`

			`;; sum neighbor count`
			`i32.add`
			`i32.add`

			`;; get current`
			`local.get $rowCenter`
			`i32.const 0x00_00_01_00`
			`i32.and`
			`i32.popcnt`
			`)`
			```

			The first major change to notice is that I'm using the regular `i32.load`
			instead of the `i32.load8_u` function I was using before. This function is
			`pulling back 32 bytes at a time, and since the cells are in consecutive bytes`
			`for each row, that means I just need one load instruction per row.`

			Each row is loaded after computing its memory offset using the `$boardWidth`
			`variable, and when loaded the cells I'm looking for should be in the three`
			`least significant bytes (as multi-byte reads in WASM are little-endian).`
			`Following the load calls I can use some bitwise operations to actually isolate and`
			`count the live cell bits, and then I use another bitwise and operation to take`
			`the already retrieved value for the center row and isolate the current cell's state.`

			`So overall, three memory reads per cell (and there will be one more to write`
			`it back, so four total per cell).`

			`#### One last micro-optimization`

			Lastly let's take a quick look at that new `$getOutcomeFromCountAndCurrent`
			`function. Of all the smaller changes I made, this is the one that actually`
			`seemed to make the most difference, and it relies on reducing the amount of`
			`branching there is in the cell determination.`

			`First, here's what this block looked like in the last post for comparison:`

			```wasm
			`;; Exactly 3 neighbors`
			`local.tee $count`
			`i32.const 3`
			`i32.eq`

			`if`
			`;; becomes or stays alive`
			`i32.const 1`
			`return`
			`end`

			`;; If currently dead`
			`local.get $row`
			`local.get $column`
			`call $getValueAtPosition`
			`i32.eqz`
			`if`
			`;; Stay dead`
			`i32.const 0`
			`return`
			`end`

			`;; 2 neighbors`
			`local.get $count`
			`i32.const 2`
			`i32.eq`
			`if`
			`i32.const 1`
			`return`
			`end`

			`;; otherwise dead`
			`i32.const 0`
			```

			It's certainly not bad - I mean there's three different `if` branches,
			`and an extra memory access here to look up the current state even though`
			`we could have already gotten that before... but it's readable right?`

			`Now let's look at the new function which replaced this block:`

			```wasm
			`(func $getOutcomeFromCountAndCurrent (param $neighborCount i32) (param $current i32) (result i32)`
			`;; get sentinal value for current count`
			`i32.const 1`
			`local.get $neighborCount`
			`i32.shl`

			`;; get value mask for current state`
			`i32.const 0xC ;;alive mask (1100, 3 or 2 neighbors)`
			`i32.const 0x8 ;;dead mask (1000, exactly three neighbors)`
			`local.get $current`
			`select`

			`;; mask out to see if we still have a bit`
			`i32.and`
			`i32.popcnt`
			`)`
			```

			`Yeah okay no matter how much I think my first implementation was okay - I`
			`can't really defend it when the alternate version is half as many instructions,`
			`several of them are constant declarations, and it has literally no branching.`

			I wish I had known about the `i32.popcnt` and `select` instructions before,
			`they're incredibly convenient for stuff like this. This change wasn't nearly`
			`as significant as the memory access reduction before, but contributed about a 5ms`
			`reduction to my frame times overall.`

			`## What's next?`

			`I will probably be continuing with this project again soon. Maybe not as the`
			`immediate next post I write again, but I am planning on trying a few more`
			`significant optimizations.`

			`The most notable one is I'd still like to bit-pack the board, and update my main`
			`cell loop to update several at once. Off the top of my head it seems like I`
			`could use this to extend my 3-read/1-write update strategy to do up to 16 cells`
			`at once (since I need one more cell to either side, and memory addresses must be`
			`byte-aligned).`
Initial copy from life 1 post Just committing so I can check diffs and describe it later 1 year ago
write-up for wasm game of life 2 1 year ago			`I also want to experiment with tracking the active regions of the board, as an`
			`area which didn't update in the last frame will again not update in the current`
			`frame. Where I'm starting with an entirely random board I don't know how much`
			`performance that will get me, but it's worth a try.`
Initial copy from life 1 post Just committing so I can check diffs and describe it later 1 year ago
write-up for wasm game of life 2 1 year ago			`Anyways, that's all for now - I hope you enjoyed this continued exploration of`
			`by-hand Webassembly programming.`