--- title: "Pure Wasm Life 2: Optimizing Webassembly and Canvas" subtitle: You know I couldn't leave it alone resources: - wasm-life-2/controller.js - wasm-life-2/game.wat - wasm-life-2/vertex.glsl - wasm-life-2/fragment.glsl --- > ***Note:*** This post is part 2 of my Webassembly Game of Life series, if you haven't already > you can read part 1 [here](/wasm-game-of-life-1). So when I posted my last Webassembly Game of Life post I had a few improvements I wanted to make to it. Some that I mentioned in that post (like bitpacking the board so it took less memory), and some that I mentioned to friends who read it (like switching from Canvas2D to WebGL or WebGPU so the drawing code stays performant with more cells). This post will not address all of those, but I have done some significant work to make my implementation faster. I'll get into the details in a bit, but first let's show off the new board! This version is drawing at one pixel per game cell (whereas the old version the game cells were 3 pixels wide), so it's simulating and drawing **nine times as many cells** without a noticeable difference to performance:

My performance target for this was to still run at 60 frames per second on my 2019 Macbook Pro. The previous version averaged around 10-11ms per frame (well within the target 16.6ms for keeping that framerate), and I'm happy to report that this version takes about 11-12ms per frame on that same machine! Let's break down what I changed to get here. ## Overview > ***A note on performance testing:*** Javascript is (in most browsers these days) a JIT-executed language, with dynamic optimized re-compilation. This can make performance testing of it *very hard*, and the performance testing I do for this post is not particularly rigorous. > > With that said, all benchmarks were done on the same hardware, in the same power state, with the same browser, and the same applications open. While not an entirely controlled environment, this is *good enough* to get the overall picture of how much my various attempts at optimization impacted things. To get an idea of where things were started, I ran the previous post's code unmodified, but with the larger 800x600 board size I wanted to target for this post. I then undertook a few different passes at optimization: first minor Webassembly changes to the game update code, then rewriting the drawing code with WebGL, and then lastly some more drastic Webassembly rewrites. You can see the average time each of these versions took on my Macbook in the table below: > | Version | Time taken for game tick | Time taken to draw board | Computed framerate | > |---|---|---|---|---| > | Original | 46ms | 70ms | 8.6 fps | > | Minor WASM Changes | 45.8ms | 70ms | 8.6 fps | > | WebGL Drawing | 44.8ms | 0.4ms | 22.1 fps | > | Major WASM Changes | 11.8ms | 0.4ms | 81.9\* fps | > > _\* This computed framerate is greater than my monitor can display, so the browser caps is to 60 fps_ Right off the bat, there were no simulation / game tick changes between the second and third rows, but we do see a change in the *measured* tick time - this indicates that my measurement method has **at least** a millisecond or two of inaccuracy, so we can conclude pretty quickly that my initial Webassembly changes were not particularly useful as optimizations. With that said though, the WebGL drawing was *very* significant, and my additional Webassembly changes following those were definitely also a huge part of how I reached my target framerate, so let's talk about those in turn. ### Drawing Game of Life with WebGL Taking a look at my old drawing code, I do want to start out and say that I didn't do anything *obviously* wrong. Iterating through every cell on the board is an `O(n)` algorithm, but for small board sizes it worked great. However, major browsers have had support for WebGL since about mid-2013 (even earlier if you didn't care about Internet Explorer!) so it's safe to say that I could stand to use the GPU for this. I'm not going to go through all the WebGL initialization code, but let's briefly look at what Javascript code is run on every frame: ```javascript function drawBoard() { const { gameExports, width, height, gl } = gameState const wasmBuffer = gameExports.shared_memory.buffer const boardPointer = gameExports.getBoardPointer() const boardLength = gameExports.getBoardLength() const boardData = (new Uint8Array(wasmBuffer)).slice(boardPointer, boardPointer + boardLength) gl.texImage2D(gl.TEXTURE_2D, 0, gl.ALPHA, width, height, 0, gl.ALPHA, gl.UNSIGNED_BYTE, boardData) gl.drawArrays(gl.TRIANGLES, 0, 6) } ``` And that's it! No loops, no iterating over the board, it's *basically nothing*. I just ask the Webassembly part of my code for a pointer, construct an array view into the memory buffer, and pass it directly to WebGL as a texture. This simplicity comes from the fact that I'm storing my board as one byte per cell, which means I can use the `gl.ALPHA` and `gl.UNSIGNED_BYTE` flags to tell WebGL to expect a texture of **exactly** that format, without any preliminary data transformation. Then the WebGL fragment shader just needs to read that data from the texture's alpha channel like so: ```glsl precision highp float; varying vec2 texCoords; uniform sampler2D textureSampler; uniform vec3 drawColor; uniform vec3 backgroundColor; void main() { vec4 textureColor = texture2D(textureSampler, texCoords); float alphaValue = textureColor.a; vec3 resultColor = mix(backgroundColor, drawColor, alphaValue * 255.0); gl_FragColor = vec4(resultColor, 1); } ``` There's a correction factor here (multiplying the value by 255), but that's because my board data had only 0s and 1s in the bytes for indicating an alive or dead cell. (And on the GPU this kind of data transformation is *very* fast, which is why I didn't first correct that in Javascript). The shader also takes two `uniform` parameters for what colors to draw with, so I will show you where those get set: ```javascript export async function onThemeChange() { const {gl, shader} = gameState const drawColorString = getComputedStyle(gameState.canvas).getPropertyValue('color') const drawColor = parseColorString(drawColorString) const backgroundColorString = getComputedStyle(gameState.canvas).getPropertyValue('--background') const backgroundColor = parseColorString(backgroundColorString) const drawColorLocation = gl.getUniformLocation(shader, "drawColor") gl.uniform3fv(drawColorLocation, drawColor) const backgroundColorLocation = gl.getUniformLocation(shader, "backgroundColor") gl.uniform3fv(backgroundColorLocation, backgroundColor) drawBoard() } ``` This is again pretty simple, just pulling some values out of the CSS properties, parsing them, and then updating WebGL's parameters. This gets run once on page initialization, and then once each time the "Appearance" button in the page navigation is clicked and the page switches between light and dark theme. That's pretty much it for the drawing, so let's talk about how I made the actual game update tick faster now. ### Optimizing the actual Webassembly code So I don't know that much about how Webassembly execution is implemented into browsers, but I know enough about actual CPU architectures to make some reasonable inferences. With that in mind, as I started thinking about more significant algorithm changes I could make, I suspected that the largest contributor to time taken in my old algorithm was probably from it taking longer to access the Webassembly linear memory than local or global variables. At an implementation level this could be because of how the wasm memory relates to the CPU cache, but I didn't really care about that \- I was just curious to see if reducing the number of memory operations my algorithm took could provide me more of a speed-up. I knew I was already pretty inefficient in this areay - my previous algorithm had made ***10 memory calls per cell*** (8 to check neighbor states, 1 to check the current cell's state, and 1 to store the updated cell's state), so I suspected I could get that much lower. #### Reducing memory accesses I started by dividing up my `$getNewValueAtPosition` function into a `$getNeighborCount` and `$getOutcomeFromCountAndCurrent`, so that I could have different logic for counting the number of alive neighbors on border cells and internal cells while still reusing the logic for actually figuring out what happens to a cell. Then the `$getNeighborCount` function became `$getNeighborCountBoundary` and `$getNeighborCountCenter` - because for optimizing memory access, I didn't want to have to worry about reading something partially aligned outside the board. I decided the old algorithm would work well enough for boundary cells for that reason. So my updated `$getNewValueAtPosition` function looks like this: ```wasm (func $getNewValueAtPosition (param $row i32) (param $column i32) (result i32) (local $count i32) (local $current i32) local.get $row i32.const 1 i32.lt_u local.get $row global.get $boardHeight i32.const 1 i32.sub i32.ge_u local.get $column i32.const 1 i32.lt_u local.get $column global.get $boardWidth i32.const 1 i32.sub i32.ge_u ;; if any of the boundary conditions are true i32.or i32.or i32.or if (result i32 i32) local.get $row local.get $column call $getNeighborCountBoundary else local.get $row local.get $column call $getNeighborCountCenter end call $getOutcomeFromCountAndCurrent ) ``` It first checks to see if the cell is along any of the four boundaries, and if so calls `$getNeighborCountBoundary`. If not it calls `$getNeighborCountCenter`, and either of these are expected to return ***two*** values, which are passed into `$getOutcomeFromCountAndCurrent`. You might be wondering why a neighbor count function would return two values, and that's because it turns out it's more efficient to retrieve the current cell's state along with neighbor states (as we'll see in a bit) - so the funciton names are admittedly a bit of a misnomer, as they get the count of alive neighbors ***and*** whether the current cell is alive as well. So the `$getNeighborCountBoundary` function is the exact same as what I used for every cell last time, but let's look at that new `$getNeighborCountCenter` function: ```wasm (func $getNeighborCountCenter (param $row i32) (param $column i32) (result i32 i32) ;; neighborCount, currentValue (local $origIndex i32) (local $rowAbove i32) (local $rowCenter i32) (local $rowBelow i32) ;; store current cell's memory position call $getBoardPtr local.get $row local.get $column call $getIndexForPosition i32.add local.tee $origIndex ;; load the three rows i32.const 1 i32.sub i32.load local.set $rowCenter local.get $origIndex i32.const 1 i32.sub global.get $boardWidth i32.sub i32.load local.set $rowAbove local.get $origIndex i32.const 1 i32.sub global.get $boardWidth i32.add i32.load local.set $rowBelow ;; count the number of alive neighbors in each row local.get $rowAbove i32.const 0x00_01_01_01 i32.and i32.popcnt local.get $rowCenter i32.const 0x00_01_00_01 i32.and i32.popcnt local.get $rowBelow i32.const 0x00_01_01_01 i32.and i32.popcnt ;; sum neighbor count i32.add i32.add ;; get current local.get $rowCenter i32.const 0x00_00_01_00 i32.and i32.popcnt ) ``` The first major change to notice is that I'm using the regular `i32.load` instead of the `i32.load8_u` function I was using before. This function is pulling back 32 bytes at a time, and since the cells are in consecutive bytes for each row, that means I just need one load instruction per row. Each row is loaded after computing its memory offset using the `$boardWidth` variable, and when loaded the cells I'm looking for should be in the three least significant bytes (as multi-byte reads in WASM are little-endian). Following the load calls I can use some bitwise operations to actually isolate and count the live cell bits, and then I use another bitwise and operation to take the already retrieved value for the center row and isolate the current cell's state. So overall, three memory reads per cell (and there will be one more to write it back, so four total per cell). #### One last micro-optimization Lastly let's take a quick look at that new `$getOutcomeFromCountAndCurrent` function. Of all the smaller changes I made, this is the one that actually seemed to make the most difference, and it relies on reducing the amount of branching there is in the cell determination. First, here's what this block looked like in the last post for comparison: ```wasm ;; Exactly 3 neighbors local.tee $count i32.const 3 i32.eq if ;; becomes or stays alive i32.const 1 return end ;; If currently dead local.get $row local.get $column call $getValueAtPosition i32.eqz if ;; Stay dead i32.const 0 return end ;; 2 neighbors local.get $count i32.const 2 i32.eq if i32.const 1 return end ;; otherwise dead i32.const 0 ``` It's certainly not *bad* - I mean there's three different `if` branches, and an extra memory access here to look up the current state even though we could have already gotten that before... but it's readable right? Now let's look at the new function which replaced this block: ```wasm (func $getOutcomeFromCountAndCurrent (param $neighborCount i32) (param $current i32) (result i32) ;; get sentinal value for current count i32.const 1 local.get $neighborCount i32.shl ;; get value mask for current state i32.const 0xC ;;alive mask (1100, 3 or 2 neighbors) i32.const 0x8 ;;dead mask (1000, exactly three neighbors) local.get $current select ;; mask out to see if we still have a bit i32.and i32.popcnt ) ``` Yeah okay no matter how much I think my first implementation was okay - I can't really defend it when the alternate version is half as many instructions, several of them are constant declarations, and it has *literally no branching*. I wish I had known about the `i32.popcnt` and `select` instructions before, they're *incredibly convenient* for stuff like this. This change wasn't nearly as significant as the memory access reduction before, but contributed about a 5ms reduction to my frame times overall. ## What's next? I will probably be continuing with this project again soon. Maybe not as the immediate next post I write again, but I am planning on trying a few more significant optimizations. The most notable one is I'd still like to bit-pack the board, and update my main cell loop to update several at once. Off the top of my head it seems like I could use this to extend my 3-read/1-write update strategy to do up to 16 cells at once (since I need one more cell to either side, and memory addresses must be byte-aligned). I also want to experiment with tracking the active regions of the board, as an area which didn't update in the last frame will again not update in the current frame. Where I'm starting with an entirely random board I don't know how much performance that will get me, but it's worth a try. Anyways, that's all for now - I hope you enjoyed this continued exploration of by-hand Webassembly programming.