You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
452 lines
16 KiB
Markdown
452 lines
16 KiB
Markdown
---
|
|
title: "Pure Wasm Life 2: Optimizing Webassembly and Canvas"
|
|
subtitle: You know I couldn't leave it alone
|
|
resources:
|
|
- wasm-life-2/controller.js
|
|
- wasm-life-2/game.wat
|
|
- wasm-life-2/vertex.glsl
|
|
- wasm-life-2/fragment.glsl
|
|
---
|
|
|
|
> ***Note:*** This post is part 2 of my Webassembly Game of Life series, if you haven't already
|
|
> you can read part 1 [here](/wasm-game-of-life-1).
|
|
|
|
So when I posted my last Webassembly Game of Life post I had a few
|
|
improvements I wanted to make to it. Some that I mentioned in that post
|
|
(like bitpacking the board so it took less memory), and some that I mentioned
|
|
to friends who read it (like switching from Canvas2D to WebGL or WebGPU so the
|
|
drawing code stays performant with more cells). This post will not address
|
|
all of those, but I have done some significant work to make my implementation
|
|
faster.
|
|
|
|
I'll get into the details in a bit, but first let's show off the new board!
|
|
This version is drawing at one pixel per game cell (whereas the old version the
|
|
game cells were 3 pixels wide), so it's simulating and drawing **nine times as many cells**
|
|
without a noticeable difference to performance:
|
|
|
|
<noscript>[If you enable Javascript, you'll see a game board here]</noscript>
|
|
<canvas
|
|
id="game"
|
|
width="800"
|
|
height="600"
|
|
data-pixelsize="1"
|
|
style="width:100%; aspect-ratio: 4/3; image-rendering: pixelated;"
|
|
/>
|
|
<p style="display: flex; align-items: flex-start; margin-top: 4px; height: 64px">
|
|
<span style="flex: 1" id="frameTimes"/>
|
|
<button id="reset">Reset</button>
|
|
</p>
|
|
<ScriptLoader src="/wasm-life-2/controller.js"
|
|
wasm="/wasm-life-2/game.wat"
|
|
vertexSrc="/wasm-life-2/vertex.glsl"
|
|
fragmentSrc="/wasm-life-2/fragment.glsl"
|
|
canvas="#game"
|
|
></ScriptLoader>
|
|
|
|
My performance target for this was to still run at 60 frames per second on
|
|
my 2019 Macbook Pro. The previous version averaged around 10-11ms per frame
|
|
(well within the target 16.6ms for keeping that framerate), and I'm happy to
|
|
report that this version takes about 11-12ms per frame on that same machine!
|
|
|
|
Let's break down what I changed to get here.
|
|
|
|
## Overview
|
|
|
|
> ***A note on performance testing:*** Javascript is (in most browsers these days) a JIT-executed language,
|
|
with dynamic optimized re-compilation. This can make performance testing of it
|
|
*very hard*, and the performance testing I do for this post is not particularly
|
|
rigorous.
|
|
>
|
|
> With that said, all benchmarks were done on the same hardware, in the
|
|
same power state, with the same browser, and the same applications open. While
|
|
not an entirely controlled environment, this is *good enough* to get the overall
|
|
picture of how much my various attempts at optimization impacted things.
|
|
|
|
To get an idea of where things were started, I ran the previous post's code unmodified,
|
|
but with the larger 800x600 board size I wanted to target for this post. I then
|
|
undertook a few different passes at optimization: first minor Webassembly changes
|
|
to the game update code, then rewriting the drawing code with WebGL, and then
|
|
lastly some more drastic Webassembly rewrites.
|
|
|
|
You can see the average time each of these versions took on my Macbook
|
|
in the table below:
|
|
|
|
> | Version | Time taken for game tick | Time taken to draw board | Computed framerate |
|
|
> |---|---|---|---|---|
|
|
> | Original | 46ms | 70ms | 8.6 fps |
|
|
> | Minor WASM Changes | 45.8ms | 70ms | 8.6 fps |
|
|
> | WebGL Drawing | 44.8ms | 0.4ms | 22.1 fps |
|
|
> | Major WASM Changes | 11.8ms | 0.4ms | 81.9\* fps |
|
|
>
|
|
> _\* This computed framerate is greater than my monitor can display, so the browser caps is to 60 fps_
|
|
|
|
Right off the bat, there were no simulation / game tick changes between the second and
|
|
third rows, but we do see a change in the *measured* tick time - this indicates that my measurement method has **at least** a millisecond
|
|
or two of inaccuracy, so we can conclude pretty quickly that my initial Webassembly
|
|
changes were not particularly useful as optimizations.
|
|
|
|
With that said though, the WebGL drawing was *very* significant, and my additional
|
|
Webassembly changes following those were definitely also a huge part of how I
|
|
reached my target framerate, so let's talk about those in turn.
|
|
|
|
### Drawing Game of Life with WebGL
|
|
|
|
Taking a look at my old drawing code, I do want to start out and say that
|
|
I didn't do anything *obviously* wrong. Iterating through every cell on the
|
|
board is an `O(n)` algorithm, but for small board sizes it worked great. However,
|
|
major browsers have had support for WebGL since about mid-2013 (even earlier if
|
|
you didn't care about Internet Explorer!) so it's safe to say that I could stand
|
|
to use the GPU for this.
|
|
|
|
I'm not going to go through all the WebGL initialization code, but let's briefly
|
|
look at what Javascript code is run on every frame:
|
|
|
|
```javascript
|
|
function drawBoard() {
|
|
const { gameExports, width, height, gl } = gameState
|
|
const wasmBuffer = gameExports.shared_memory.buffer
|
|
const boardPointer = gameExports.getBoardPointer()
|
|
const boardLength = gameExports.getBoardLength()
|
|
const boardData = (new Uint8Array(wasmBuffer)).slice(boardPointer, boardPointer + boardLength)
|
|
|
|
gl.texImage2D(gl.TEXTURE_2D, 0, gl.ALPHA, width, height, 0, gl.ALPHA, gl.UNSIGNED_BYTE, boardData)
|
|
gl.drawArrays(gl.TRIANGLES, 0, 6)
|
|
}
|
|
```
|
|
|
|
And that's it!
|
|
|
|
No loops, no iterating over the board, it's *basically nothing*. I just ask the
|
|
Webassembly part of my code for a pointer, construct an array view into the
|
|
memory buffer, and pass it directly to WebGL as a texture.
|
|
|
|
This simplicity comes from the fact that I'm storing my board as one byte per cell,
|
|
which means I can use the `gl.ALPHA` and `gl.UNSIGNED_BYTE` flags to tell WebGL
|
|
to expect a texture of **exactly** that format, without any preliminary data
|
|
transformation.
|
|
|
|
Then the WebGL fragment shader just needs to read that data from the texture's
|
|
alpha channel like so:
|
|
|
|
```glsl
|
|
precision highp float;
|
|
varying vec2 texCoords;
|
|
uniform sampler2D textureSampler;
|
|
uniform vec3 drawColor;
|
|
uniform vec3 backgroundColor;
|
|
|
|
void main() {
|
|
vec4 textureColor = texture2D(textureSampler, texCoords);
|
|
float alphaValue = textureColor.a;
|
|
|
|
vec3 resultColor = mix(backgroundColor, drawColor, alphaValue * 255.0);
|
|
|
|
gl_FragColor = vec4(resultColor, 1);
|
|
}
|
|
```
|
|
|
|
There's a correction factor here (multiplying the value by 255), but that's because
|
|
my board data had only 0s and 1s in the bytes for indicating an alive or dead
|
|
cell, and the shader expects that texture value to have a range between 0 and 255 (the range of a byte).
|
|
|
|
The shader also takes two `uniform` parameters for what colors to draw with, so
|
|
I will show you where those get set:
|
|
|
|
```javascript
|
|
export async function onThemeChange() {
|
|
const {gl, shader} = gameState
|
|
|
|
const drawColorString = getComputedStyle(gameState.canvas).getPropertyValue('color')
|
|
const drawColor = parseColorString(drawColorString)
|
|
|
|
const backgroundColorString = getComputedStyle(gameState.canvas).getPropertyValue('--background')
|
|
const backgroundColor = parseColorString(backgroundColorString)
|
|
|
|
const drawColorLocation = gl.getUniformLocation(shader, "drawColor")
|
|
gl.uniform3fv(drawColorLocation, drawColor)
|
|
|
|
const backgroundColorLocation = gl.getUniformLocation(shader, "backgroundColor")
|
|
gl.uniform3fv(backgroundColorLocation, backgroundColor)
|
|
|
|
drawBoard()
|
|
}
|
|
```
|
|
|
|
This is again pretty simple, just pulling some values out of the CSS properties,
|
|
parsing them, and then updating WebGL's parameters. This gets run once on page
|
|
initialization, and then once each time the "Appearance" button in the page navigation
|
|
is clicked and the page switches between light and dark theme.
|
|
|
|
That's pretty much it for the drawing, so let's talk about how I made the actual
|
|
game update tick faster now.
|
|
|
|
### Optimizing the actual Webassembly code
|
|
|
|
So I don't know that much about how Webassembly execution is implemented into
|
|
browsers, but I know enough about actual CPU architectures to make some reasonable
|
|
inferences.
|
|
|
|
As I started thinking about more significant algorithm changes
|
|
I could make, I suspected that the largest contributor to time taken in my old
|
|
algorithm was probably from it taking longer to access the Webassembly linear memory
|
|
than local or global variables. At an implementation level this could be because
|
|
of how the wasm memory relates to the CPU cache, but I didn't really care about that
|
|
\- I was just curious to see if reducing the number of memory operations my algorithm
|
|
took could provide me more of a speed-up.
|
|
|
|
I knew I was already pretty inefficient in this area - my previous algorithm had made ***10 memory calls per cell***
|
|
(8 to check neighbor states, 1 to check the current cell's state, and 1 to store
|
|
the updated cell's state), so I suspected I could get that much lower.
|
|
|
|
#### Reducing memory accesses
|
|
|
|
I started by dividing up my `$getNewValueAtPosition` function into a `$getNeighborCount`
|
|
and `$getOutcomeFromCountAndCurrent`, so that I could have different logic
|
|
for counting the number of alive neighbors on border cells and internal cells while
|
|
still reusing the logic for actually figuring out what happens to a cell.
|
|
|
|
Then the `$getNeighborCount` function became `$getNeighborCountBoundary` and
|
|
`$getNeighborCountCenter` - because for optimizing memory access, I didn't want
|
|
to have to worry about reading something partially aligned outside the board.
|
|
I decided the old algorithm would work well enough for boundary cells for that
|
|
reason.
|
|
|
|
So my updated `$getNewValueAtPosition` function looks like this:
|
|
|
|
```wasm
|
|
(func $getNewValueAtPosition (param $row i32) (param $column i32) (result i32)
|
|
(local $count i32)
|
|
(local $current i32)
|
|
local.get $row
|
|
i32.const 1
|
|
i32.lt_u
|
|
|
|
local.get $row
|
|
global.get $boardHeight
|
|
i32.const 1
|
|
i32.sub
|
|
i32.ge_u
|
|
|
|
local.get $column
|
|
i32.const 1
|
|
i32.lt_u
|
|
|
|
local.get $column
|
|
global.get $boardWidth
|
|
i32.const 1
|
|
i32.sub
|
|
i32.ge_u
|
|
|
|
;; if any of the boundary conditions are true
|
|
i32.or
|
|
i32.or
|
|
i32.or
|
|
|
|
if (result i32 i32)
|
|
local.get $row
|
|
local.get $column
|
|
call $getNeighborCountBoundary
|
|
else
|
|
local.get $row
|
|
local.get $column
|
|
call $getNeighborCountCenter
|
|
end
|
|
|
|
call $getOutcomeFromCountAndCurrent
|
|
)
|
|
```
|
|
|
|
It first checks to see if the cell is along any of the four boundaries, and if
|
|
so calls `$getNeighborCountBoundary`. If not it calls `$getNeighborCountCenter`,
|
|
and either of these are expected to return ***two*** values, which are passed into
|
|
`$getOutcomeFromCountAndCurrent`.
|
|
|
|
You might be wondering why a neighbor count function would return two values, and
|
|
that's because it turns out it's more efficient to retrieve the current cell's state
|
|
along with neighbor states (as we'll see in a bit) - so the funciton names are admittedly
|
|
a bit of a misnomer, as they get the count of alive neighbors ***and*** whether the
|
|
current cell is alive as well.
|
|
|
|
So the `$getNeighborCountBoundary` function is the exact same as what I used for
|
|
every cell last time, but let's look at that new `$getNeighborCountCenter` function:
|
|
|
|
```wasm
|
|
(func $getNeighborCountCenter (param $row i32) (param $column i32) (result i32 i32) ;; neighborCount, currentValue
|
|
(local $origIndex i32)
|
|
(local $rowAbove i32)
|
|
(local $rowCenter i32)
|
|
(local $rowBelow i32)
|
|
|
|
;; store current cell's memory position
|
|
call $getBoardPtr
|
|
local.get $row
|
|
local.get $column
|
|
call $getIndexForPosition
|
|
i32.add
|
|
local.tee $origIndex
|
|
|
|
;; load the three rows
|
|
i32.const 1
|
|
i32.sub
|
|
i32.load
|
|
local.set $rowCenter
|
|
|
|
local.get $origIndex
|
|
i32.const 1
|
|
i32.sub
|
|
global.get $boardWidth
|
|
i32.sub
|
|
i32.load
|
|
local.set $rowAbove
|
|
|
|
local.get $origIndex
|
|
i32.const 1
|
|
i32.sub
|
|
global.get $boardWidth
|
|
i32.add
|
|
i32.load
|
|
local.set $rowBelow
|
|
|
|
;; count the number of alive neighbors in each row
|
|
local.get $rowAbove
|
|
i32.const 0x00_01_01_01
|
|
i32.and
|
|
i32.popcnt
|
|
|
|
local.get $rowCenter
|
|
i32.const 0x00_01_00_01
|
|
i32.and
|
|
i32.popcnt
|
|
|
|
|
|
local.get $rowBelow
|
|
i32.const 0x00_01_01_01
|
|
i32.and
|
|
i32.popcnt
|
|
|
|
;; sum neighbor count
|
|
i32.add
|
|
i32.add
|
|
|
|
;; get current
|
|
local.get $rowCenter
|
|
i32.const 0x00_00_01_00
|
|
i32.and
|
|
i32.popcnt
|
|
)
|
|
```
|
|
|
|
The first major change to notice is that I'm using the regular `i32.load`
|
|
instead of the `i32.load8_u` function I was using before. This function is
|
|
pulling back 32 bytes at a time, and since the cells are in consecutive bytes
|
|
for each row, that means I just need one load instruction per row.
|
|
|
|
Each row is loaded after computing its memory offset using the `$boardWidth`
|
|
variable, and when loaded the cells I'm looking for should be in the three
|
|
least significant bytes (as multi-byte reads in WASM are little-endian).
|
|
Following the load calls I can use some bitwise operations to actually isolate and
|
|
count the live cell bits, and then I use another bitwise and operation to take
|
|
the already retrieved value for the center row and isolate the current cell's state.
|
|
|
|
So overall, three memory reads per cell (and there will be one more to write
|
|
it back, so four total per cell).
|
|
|
|
#### One last micro-optimization
|
|
|
|
Lastly let's take a quick look at that new `$getOutcomeFromCountAndCurrent`
|
|
function. Of all the smaller changes I made, this is the one that actually
|
|
seemed to make the most difference, and it relies on reducing the amount of
|
|
branching there is in the cell determination.
|
|
|
|
First, here's what this block looked like in the last post for comparison:
|
|
|
|
```wasm
|
|
;; Exactly 3 neighbors
|
|
local.tee $count
|
|
i32.const 3
|
|
i32.eq
|
|
|
|
if
|
|
;; becomes or stays alive
|
|
i32.const 1
|
|
return
|
|
end
|
|
|
|
;; If currently dead
|
|
local.get $row
|
|
local.get $column
|
|
call $getValueAtPosition
|
|
i32.eqz
|
|
if
|
|
;; Stay dead
|
|
i32.const 0
|
|
return
|
|
end
|
|
|
|
;; 2 neighbors
|
|
local.get $count
|
|
i32.const 2
|
|
i32.eq
|
|
if
|
|
i32.const 1
|
|
return
|
|
end
|
|
|
|
;; otherwise dead
|
|
i32.const 0
|
|
```
|
|
|
|
It's certainly not *bad* - I mean there's three different `if` branches,
|
|
and an extra memory access here to look up the current state even though
|
|
we could have already gotten that before... but it's readable right?
|
|
|
|
Now let's look at the new function which replaced this block:
|
|
|
|
```wasm
|
|
(func $getOutcomeFromCountAndCurrent (param $neighborCount i32) (param $current i32) (result i32)
|
|
;; get sentinal value for current count
|
|
i32.const 1
|
|
local.get $neighborCount
|
|
i32.shl
|
|
|
|
;; get value mask for current state
|
|
i32.const 0xC ;;alive mask (1100, 3 or 2 neighbors)
|
|
i32.const 0x8 ;;dead mask (1000, exactly three neighbors)
|
|
local.get $current
|
|
select
|
|
|
|
;; mask out to see if we still have a bit
|
|
i32.and
|
|
i32.popcnt
|
|
)
|
|
```
|
|
|
|
Yeah okay no matter how much I think my first implementation was okay - I
|
|
can't really defend it when the alternate version is half as many instructions,
|
|
several of them are constant declarations, and it has *literally no branching*.
|
|
|
|
I wish I had known about the `i32.popcnt` and `select` instructions before,
|
|
they're *incredibly convenient* for stuff like this. This change wasn't nearly
|
|
as significant as the memory access reduction before, but contributed about a 5ms
|
|
reduction to my frame times overall.
|
|
|
|
## What's next?
|
|
|
|
I will probably be continuing with this project again soon. Maybe not as the
|
|
immediate next post I write again, but I am planning on trying a few more
|
|
significant optimizations.
|
|
|
|
The most notable one is I'd still like to bit-pack the board, and update my main
|
|
cell loop to update several at once. Off the top of my head it seems like I
|
|
could use this to extend my 3-read/1-write update strategy to do up to 16 cells
|
|
at once (since I need one more cell to either side, and memory addresses must be
|
|
byte-aligned).
|
|
|
|
I also want to experiment with tracking the active regions of the board, as an
|
|
area which didn't update in the last frame will again not update in the current
|
|
frame. Where I'm starting with an entirely random board I don't know how much
|
|
performance that will get me, but it's worth a try.
|
|
|
|
Anyways, that's all for now - I hope you enjoyed this continued exploration of
|
|
by-hand Webassembly programming.
|