You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ashen-earth/posts/2022-04-05_debugging-write-...

6.3 KiB

title subtitle unlisted
Sample Debugging Write-up true

This is a sample write-up of how I investigated an issue for work.

I'm putting this up with permisison from my former job, but it is unlisted and I would appreciate if you do not share it around.

Additionally, code screenshots originally included when sending this write-up to my co-workers have been removed so as to not leak confidential information. Where applicable the document has been lightly re-worded to still make sense.

Initial Issue

To start off with, let's talk about the actual bug. We observed that after deploying a small change (which added a few static files and a URL redirect), our A/B test between versions of the site started to act weird. Several of the pages that had been implemented in the new design would briefly display the old UI, then flash abruptly to the new version.

This issue was easily reproducible, and from our analytics it appeared to be happening to all visitors assigned to the old UI. The weirdest part though was that none of those affected pages were at all related to the changes we had eployed!

At first I didn't really know what was going on, so I started poking around in the sources for the various pages that manifested this issue. I didn't expect there to be anything there, particularly since none of them had been changed in the bad deploy, but it was a good place to start.

It's important to note here that our A/B testing for this application was done at least in part at the HTTP routing level. Multiple different Next.js page paths were being remapped by our reverse proxy to the same external URL, and so from the beginning I knew something was rather weird here, as the old design wasn't even importing any of the components of the new design. Just from that we knew this couldn't be a component-level issue, and this had something to do with Next's page routing.

Weird behavior in webpack

After quickly double checking there was nothing obvious in the pages themselves, I opened up the browser debugger and checked to see which component files were being loaded. I thought perhaps there could be an issue where the static cached version of the page was the old design but then on rendering in the client it would reset to the new one.

While I was there, I happened to see something interesting: both versions of the page were having their JS bundles loaded, and both still had sourcemaps cached in the browser developer tools. This seemed weird to me. If the browser were (for some reason) choosing to reload the tab, it would have unloaded all of the old layout's resources. Seeing both loaded at once indicated that Next was doing a client side navigation from one to the other.

So after noting down that I needed to remind the team to disable sourcemaps in production, I started digging into the debugger and dropping breakpoints.

Debugging

I added a breakpoint to Next's router.push() and router.replace() functions, just to see if they were being called at all.

As luck would have it, I was right - the replace() function was being called immediately after the page's first render. Firefox's debugger even helpfully had a call stack for me, with where it was being called from!

As I clicked through the call stack I found there was a check somewhere in one of Next's internal components - as part of its componentDidMount() function there were a few conditions where it would force-reload the page.

Because of how it's built, Next's sourcemaps were incomplete in the view I had, but after comparing some of the constants in the file I was able to track it down to the file next/client/index.js in the Next.js source (line 181).

This section definitely looked like it could be causing the issue - the way Next was figuring out what URL to pass to router.replace() was relying on the URL in the browser's location history, which means that it would entirely bypass our reverse proxy A/B routing. Unfortunately I didn't yet know why this was being called, or if it was the only reason the page was being reinitialized.

I spent a while walking through the conditions in this line one by one, using my debugger to check the value of each variable in the faulty production pages, and work out what exactly was happening. It skips the checks for 404 and error pages, since that's not the kind of page we were on, isFallback is false so it moves on to the other conditions combined with an || there, data.nextExport is undefined (as we don't use static export for our site), but hydrateProps and hydrateProps.__N_SSG were both set!

So because hydrateProps and hydrateProps.__N_SSG both exist, it reloads the page if either of the following conditions are true:

  • The page has a query string (location.search)
  • The environment variable __NEXT_HAS_REWRITES is set

Obviously the pages we first observed the bug on didn't have query strings, but the rewrite condition had been optimized by webpack to just true so that seemed promising. With a bit more searching I tracked that environment variable to Next's webpack config, which seems to set the variable if Next's config file has a rewrite property configured.

Confirming the hunch

At this point it seemed I had found the issue, but I wanted to confirm it without any doubt. So I reverted my local repo to the last commit we had deployed before seeing the issue, and quickly navigated back to that same line in the generated app bundle. This time the process.env.__NEXT_HAS_REWRITES condition had been entirely removed, as webpack saw that environment variable was false for the build and could never affect the condition.

Lastly I spun up my full local environment, reverse proxy and all, just to confirm that in as close to a production-like environment as I could. I removed the rewrite rule from the Next.js application and made an equivalent redirect in our reverse proxy - just to make sure I could preserve the intended application behavior without the bug.

And sure enough - with the redirect behavior moved out of Next.js and into the reverse proxy, the bug entirely disappeared! So I reverted and restored my changes a few times just to double check, and then alerted the other senior developers I had a solution.