You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ashen-earth/posts/2022-04-05_debugging-write-...

121 lines
6.3 KiB
Markdown

---
title: Sample Debugging Write-up
subtitle: ''
unlisted: true
---
> This is a sample write-up of how I investigated an issue for work.
>
> I'm putting this up with permisison from my former job, but it is ***unlisted***
and I would appreciate if you do not share it around.
>
> Additionally, code screenshots originally included when sending this write-up
to my co-workers have been removed so as to not leak confidential information.
Where applicable the document has been lightly re-worded to still make sense.
## Initial Issue
To start off with, let's talk about the actual bug. We observed that after
deploying a small change (which added a few static files and a URL redirect), our A/B
test between versions of the site started to act weird. Several of the pages
that had been implemented in the new design would briefly display the old UI,
then flash abruptly to the new version.
This issue was easily reproducible, and from our analytics it appeared
to be happening to **all** visitors assigned to the old UI. The weirdest part
though was that none of those affected pages were at all related to the changes
we had eployed!
At first I didn't really know what was going on, so I started poking around
in the sources for the various pages that manifested this issue. I didn't
expect there to be anything there, particularly since none of them had been
changed in the bad deploy, but it was a good place to start.
It's important to note here that our A/B testing for this application was
done at least in part at the HTTP routing level. Multiple different Next.js page
paths were being remapped by our reverse proxy to the same external URL, and
so from the beginning I knew something was rather weird here, as the old design
wasn't even importing any of the components of the new design. Just from that
we knew this couldn't be a component-level issue, and this had something to
do with Next's page routing.
## Weird behavior in webpack
After quickly double checking there was nothing obvious in the pages themselves,
I opened up the browser debugger and checked to see
which component files were being loaded. I thought perhaps there
could be an issue where the static cached version of the page was the old design
but then on rendering in the client it would reset to the new one.
While I was there, I happened to see something interesting: both versions of the page
were having their JS bundles loaded, and both still had sourcemaps cached in the browser developer tools. This seemed
weird to me. If the browser were (for some reason) choosing to reload the tab, it would
have unloaded all of the old layout's resources. Seeing both loaded at once
indicated that Next was doing a *client side* navigation from one to the other.
So after noting down that I needed to remind the team to disable sourcemaps in production, I started
digging into the debugger and dropping breakpoints.
## Debugging
I added a breakpoint to Next's `router.push()` and `router.replace()` functions,
just to see if they were being called at all.
As luck would have it, I was right - the `replace()` function *was* being called
immediately after the page's first render.
Firefox's debugger even helpfully had a call stack for me, with where it was being called from!
As I clicked through the call stack I found there was a check somewhere in one
of Next's internal components - as part of its `componentDidMount()` function
there were a few conditions where it would force-reload the page.
Because of how it's built, Next's sourcemaps were incomplete in the view I had,
but after comparing some of the constants in the file I was able to track it down
to the file [next/client/index.js](https://github.com/vercel/next.js/blob/v11.1.0/packages/next/client/index.tsx#L181)
in the Next.js source (line 181).
This section definitely looked like it could be causing the issue - the way Next
was figuring out what URL to pass to `router.replace()` was relying on the URL
in the browser's location history, which means that it would entirely bypass
our reverse proxy A/B routing. Unfortunately I didn't yet know why this was
being called, or if it was the only reason the page was being reinitialized.
I spent a while walking through the conditions in this line one by one, using
my debugger to check the value of each variable in the faulty production pages,
and work out what exactly was happening. It skips the checks for 404 and error pages,
since that's not the kind of page we were on, `isFallback` is false so it moves
on to the other conditions combined with an `||` there, `data.nextExport` is
undefined (as we don't use static export for our site), but `hydrateProps` and
`hydrateProps.__N_SSG` were both set!
So because `hydrateProps` and `hydrateProps.__N_SSG` both exist, it reloads the
page if either of the following conditions are true:
- The page has a query string (`location.search`)
- The environment variable `__NEXT_HAS_REWRITES` is set
Obviously the pages we first observed the bug on didn't have query strings, but
the rewrite condition had been optimized by webpack to just `true` so that seemed promising. With a bit more searching I tracked that
environment variable to Next's [webpack config](https://github.com/vercel/next.js/blob/v11.1.0/packages/next/build/webpack-config.ts),
which seems to set the variable if Next's config file has a rewrite property configured.
## Confirming the hunch
At this point it seemed I had found the issue, but I wanted to confirm it without
any doubt. So I reverted my local repo to the last commit we had deployed before
seeing the issue, and quickly navigated back to that same line in the generated
app bundle. This time the `process.env.__NEXT_HAS_REWRITES` condition had been
entirely removed, as webpack saw that environment variable was `false` for the build
and could never affect the condition.
Lastly I spun up my full local environment, reverse proxy and all, just to confirm
that in as close to a production-like environment as I could. I removed the rewrite
rule from the Next.js application and made an equivalent redirect in our reverse
proxy - just to make sure I could preserve the intended application behavior without
the bug.
And sure enough - with the redirect behavior moved out of Next.js and into the reverse
proxy, the bug entirely disappeared! So I reverted and restored my changes a few times
just to double check, and then alerted the other senior developers I had a solution.