You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
121 lines
6.3 KiB
Markdown
121 lines
6.3 KiB
Markdown
---
|
|
title: Sample Debugging Write-up
|
|
subtitle: ''
|
|
unlisted: true
|
|
---
|
|
|
|
> This is a sample write-up of how I investigated an issue for work.
|
|
>
|
|
> I'm putting this up with permisison from my former job, but it is ***unlisted***
|
|
and I would appreciate if you do not share it around.
|
|
>
|
|
> Additionally, code screenshots originally included when sending this write-up
|
|
to my co-workers have been removed so as to not leak confidential information.
|
|
Where applicable the document has been lightly re-worded to still make sense.
|
|
|
|
## Initial Issue
|
|
|
|
To start off with, let's talk about the actual bug. We observed that after
|
|
deploying a small change (which added a few static files and a URL redirect), our A/B
|
|
test between versions of the site started to act weird. Several of the pages
|
|
that had been implemented in the new design would briefly display the old UI,
|
|
then flash abruptly to the new version.
|
|
|
|
This issue was easily reproducible, and from our analytics it appeared
|
|
to be happening to **all** visitors assigned to the old UI. The weirdest part
|
|
though was that none of those affected pages were at all related to the changes
|
|
we had eployed!
|
|
|
|
At first I didn't really know what was going on, so I started poking around
|
|
in the sources for the various pages that manifested this issue. I didn't
|
|
expect there to be anything there, particularly since none of them had been
|
|
changed in the bad deploy, but it was a good place to start.
|
|
|
|
It's important to note here that our A/B testing for this application was
|
|
done at least in part at the HTTP routing level. Multiple different Next.js page
|
|
paths were being remapped by our reverse proxy to the same external URL, and
|
|
so from the beginning I knew something was rather weird here, as the old design
|
|
wasn't even importing any of the components of the new design. Just from that
|
|
we knew this couldn't be a component-level issue, and this had something to
|
|
do with Next's page routing.
|
|
|
|
## Weird behavior in webpack
|
|
|
|
After quickly double checking there was nothing obvious in the pages themselves,
|
|
I opened up the browser debugger and checked to see
|
|
which component files were being loaded. I thought perhaps there
|
|
could be an issue where the static cached version of the page was the old design
|
|
but then on rendering in the client it would reset to the new one.
|
|
|
|
While I was there, I happened to see something interesting: both versions of the page
|
|
were having their JS bundles loaded, and both still had sourcemaps cached in the browser developer tools. This seemed
|
|
weird to me. If the browser were (for some reason) choosing to reload the tab, it would
|
|
have unloaded all of the old layout's resources. Seeing both loaded at once
|
|
indicated that Next was doing a *client side* navigation from one to the other.
|
|
|
|
So after noting down that I needed to remind the team to disable sourcemaps in production, I started
|
|
digging into the debugger and dropping breakpoints.
|
|
|
|
## Debugging
|
|
|
|
I added a breakpoint to Next's `router.push()` and `router.replace()` functions,
|
|
just to see if they were being called at all.
|
|
|
|
As luck would have it, I was right - the `replace()` function *was* being called
|
|
immediately after the page's first render.
|
|
Firefox's debugger even helpfully had a call stack for me, with where it was being called from!
|
|
|
|
As I clicked through the call stack I found there was a check somewhere in one
|
|
of Next's internal components - as part of its `componentDidMount()` function
|
|
there were a few conditions where it would force-reload the page.
|
|
|
|
Because of how it's built, Next's sourcemaps were incomplete in the view I had,
|
|
but after comparing some of the constants in the file I was able to track it down
|
|
to the file [next/client/index.js](https://github.com/vercel/next.js/blob/v11.1.0/packages/next/client/index.tsx#L181)
|
|
in the Next.js source (line 181).
|
|
|
|
This section definitely looked like it could be causing the issue - the way Next
|
|
was figuring out what URL to pass to `router.replace()` was relying on the URL
|
|
in the browser's location history, which means that it would entirely bypass
|
|
our reverse proxy A/B routing. Unfortunately I didn't yet know why this was
|
|
being called, or if it was the only reason the page was being reinitialized.
|
|
|
|
I spent a while walking through the conditions in this line one by one, using
|
|
my debugger to check the value of each variable in the faulty production pages,
|
|
and work out what exactly was happening. It skips the checks for 404 and error pages,
|
|
since that's not the kind of page we were on, `isFallback` is false so it moves
|
|
on to the other conditions combined with an `||` there, `data.nextExport` is
|
|
undefined (as we don't use static export for our site), but `hydrateProps` and
|
|
`hydrateProps.__N_SSG` were both set!
|
|
|
|
So because `hydrateProps` and `hydrateProps.__N_SSG` both exist, it reloads the
|
|
page if either of the following conditions are true:
|
|
|
|
- The page has a query string (`location.search`)
|
|
- The environment variable `__NEXT_HAS_REWRITES` is set
|
|
|
|
Obviously the pages we first observed the bug on didn't have query strings, but
|
|
the rewrite condition had been optimized by webpack to just `true` so that seemed promising. With a bit more searching I tracked that
|
|
environment variable to Next's [webpack config](https://github.com/vercel/next.js/blob/v11.1.0/packages/next/build/webpack-config.ts),
|
|
which seems to set the variable if Next's config file has a rewrite property configured.
|
|
|
|
## Confirming the hunch
|
|
|
|
At this point it seemed I had found the issue, but I wanted to confirm it without
|
|
any doubt. So I reverted my local repo to the last commit we had deployed before
|
|
seeing the issue, and quickly navigated back to that same line in the generated
|
|
app bundle. This time the `process.env.__NEXT_HAS_REWRITES` condition had been
|
|
entirely removed, as webpack saw that environment variable was `false` for the build
|
|
and could never affect the condition.
|
|
|
|
Lastly I spun up my full local environment, reverse proxy and all, just to confirm
|
|
that in as close to a production-like environment as I could. I removed the rewrite
|
|
rule from the Next.js application and made an equivalent redirect in our reverse
|
|
proxy - just to make sure I could preserve the intended application behavior without
|
|
the bug.
|
|
|
|
And sure enough - with the redirect behavior moved out of Next.js and into the reverse
|
|
proxy, the bug entirely disappeared! So I reverted and restored my changes a few times
|
|
just to double check, and then alerted the other senior developers I had a solution.
|
|
|