Antoine Beaupr : Major outage with Oricom uplink
The server that normally serves this page, all my email, and many more
services was unavailable for about 24 hours. This post explains how and
why.
What happened?
Starting February 2nd, I started seeing intermittent packet loss on
the network. Every hour or so, the link would go down for one or two
minutes, then come back up.
At first, I didn't think much of it because I was away and could blame
the crappy wifi or the uplink I using. But when I came in the office
on Monday, the service was indeed seriously degraded. I could barely
do videoconferencing calls as they would cut out after about half an
hour.
I opened a ticket with my uplink, Oricom. They replied that it
was an issue they couldn't fix on their end and would need someone on
site to fix.
So, the next day (Tuesday, at around 10EST) I called Oricom again, and
they made me do a full modem reset, which involves plugging a pin in a
hole for 15 seconds on the Technicolor TC4400 cable modem. Then
the link went down, and it didn't come back up at all.
Boom.
Oricom then escalated this to their upstream (Oricom is a reseller of
Videotron, who has basically the monopoly on cable in Qu bec) which
dispatched a tech. This tech, in turn, arrived some time after lunch
and said the link worked fine and it was a hardware issue.
At this point, Oricom put a new modem in the mail and I started
mitigation.
What happened?
Starting February 2nd, I started seeing intermittent packet loss on
the network. Every hour or so, the link would go down for one or two
minutes, then come back up.
At first, I didn't think much of it because I was away and could blame
the crappy wifi or the uplink I using. But when I came in the office
on Monday, the service was indeed seriously degraded. I could barely
do videoconferencing calls as they would cut out after about half an
hour.
I opened a ticket with my uplink, Oricom. They replied that it
was an issue they couldn't fix on their end and would need someone on
site to fix.
So, the next day (Tuesday, at around 10EST) I called Oricom again, and
they made me do a full modem reset, which involves plugging a pin in a
hole for 15 seconds on the Technicolor TC4400 cable modem. Then
the link went down, and it didn't come back up at all.
Boom.
Oricom then escalated this to their upstream (Oricom is a reseller of
Videotron, who has basically the monopoly on cable in Qu bec) which
dispatched a tech. This tech, in turn, arrived some time after lunch
and said the link worked fine and it was a hardware issue.
At this point, Oricom put a new modem in the mail and I started
mitigation.
Mitigation
Website
The first thing I did, weirdly, was trying to rebuild this blog. I
figured it should be pretty simple: install ikiwiki and hit rebuild. I
knew I had some patches on ikiwiki to deploy, but
surely those are not a deal breaker, right?
Nope. Turns out I wrote many plugins and those still don't ship with
ikiwiki, despite having been sent upstream a while back, some years
ago.
So I deployed the plugins inside the .ikiwiki
directory of the site
in the hope of making things a little more
"standalone". Unfortunately, that didn't work either because the
theme must be shipped in the system-wide location: I couldn't figure
out how to put it to have it bundled with the main repository. At that
point I mostly gave up because I had spent too much time on this and I
had to do something about email otherwise it would start to bounce.
Email
So I made a new VM at Linode (thanks 2.5admins for the credits)
to build a new mail server.
This wasn't the best idea, in retrospect, because it was really
overkill: I started rebuilding the whole mail server from scratch.
Ideally, this would be in Puppet and I would just deploy the right
profile and the server would be rebuilt. Unfortunately, that part of
my infrastructure is not Puppetized and even if it would, well the
Puppet server was also down so I would have had to bring that up
first.
At first, I figured I would just make a secondary mail exchanger (MX),
to spool mail for longer so that I wouldn't lose it. But I decided
against that: I thought it was too hard to make a "proper" MX as it
needs to also filter mail while avoiding backscatter. Might as well
just build a whole new server! I had a copy of my full mail spool on
my laptop, so I figured that was possible.
I mostly got this right: added a DKIM key, installed Postfix, Dovecot,
OpenDKIM, OpenDMARC, glue it all together, and voil , I had a mail
server. Oh, and spampd. Oh, and I need the training data, oh, and this
and... I wasn't done and it was time to sleep.
The mail server went online this morning, and started accepting
mail. I tried syncing my laptop mail spool against it, but that failed
because Dovecot generated new UIDs for the emails, and isync correctly
failed to sync. I tried to copy the UIDs from the server in the office
(which I had still access to locally), but that somehow didn't work
either.
But at least the mail was getting delivered and stored properly. I
even had the Sieve rules setup so it would get sorted properly
too. Unfortunately, I didn't hook that up properly, so those didn't
actually get sorted. Thankfully, Dovecot can re-filter emails with the
sieve-filter command, so that was fixed later.
At this point, I started looking for other things to fix.
Web, again
I figured I was almost done with the website, might as well publish
it. So I installed the Nginx Debian package, got a cert with
certbot, and added the certs to the default configuration. I
rsync
'd my build in /var/www/html
and boom, I had a website. The
Goatcounter analytics were timing out, but that was easy to turn
off.
Resolution
Almost at that exact moment, a bang on the door told me mail was here
and I had the modem. I plugged it in and a few minutes later, marcos
was back online.
So this was a lot (a lot!) of work for basically nothing. I could
have just taken the day off and wait for the package to be
delivered. It would definitely have been better to make a simpler mail
exchanger to spool the mail to avoid losing it. And in fact, that's
what I eventually ended up doing: I converted the linode server in a
mail relay to continue accepting mail with DNS propagates, but without
having to sort the mail out of there...
Right now I have about 200 mails in a mailbox that I need to move back
into marcos. Normally, this would just be a simple rsync, but because
both servers have accepted mail simultaneously, it's going to be
simpler to just move those exact mails on there. Because dovecot
helpfully names delivered files with the hostname it's running on,
it's easy to find those files and transfer them, basically:
rsync -v -n --files-from=<(ssh colette.anarc.at find Maildir -name '*colette*' ) colette.anarc.at: colette/
rsync -v -n --files-from=<(ssh colette.anarc.at find Maildir -name '*colette*' ) colette/ marcos.anarc.at:
Overall, the outage lasted about 24 hours, from 11:00EST (16:00UTC) on
2023-02-07 to the same time today.
Future work
I'll probably keep a mail relay to make those situations more
manageable in the future. At first I thought that mail filtering would
be a problem, but that happens post queue anyways and I don't bounce
mail based on Spamassassin, so back-scatter shouldn't be an issue.
I basically need Postfix, OpenDMARC, and Postgrey. I'm not even sure I
need OpenDKIM as the server won't process outgoing mail, so it doesn't
need to sign anything, just check incoming signatures, which
OpenDMARC can (probably?) do.
Thanks to everyone who supported me through this ordeal, you know who
you are (and I'm happy to give credit here if you want to be
deanonymized)!
Website
The first thing I did, weirdly, was trying to rebuild this blog. I
figured it should be pretty simple: install ikiwiki and hit rebuild. I
knew I had some patches on ikiwiki to deploy, but
surely those are not a deal breaker, right?
Nope. Turns out I wrote many plugins and those still don't ship with
ikiwiki, despite having been sent upstream a while back, some years
ago.
So I deployed the plugins inside the .ikiwiki
directory of the site
in the hope of making things a little more
"standalone". Unfortunately, that didn't work either because the
theme must be shipped in the system-wide location: I couldn't figure
out how to put it to have it bundled with the main repository. At that
point I mostly gave up because I had spent too much time on this and I
had to do something about email otherwise it would start to bounce.
Email
So I made a new VM at Linode (thanks 2.5admins for the credits)
to build a new mail server.
This wasn't the best idea, in retrospect, because it was really
overkill: I started rebuilding the whole mail server from scratch.
Ideally, this would be in Puppet and I would just deploy the right
profile and the server would be rebuilt. Unfortunately, that part of
my infrastructure is not Puppetized and even if it would, well the
Puppet server was also down so I would have had to bring that up
first.
At first, I figured I would just make a secondary mail exchanger (MX),
to spool mail for longer so that I wouldn't lose it. But I decided
against that: I thought it was too hard to make a "proper" MX as it
needs to also filter mail while avoiding backscatter. Might as well
just build a whole new server! I had a copy of my full mail spool on
my laptop, so I figured that was possible.
I mostly got this right: added a DKIM key, installed Postfix, Dovecot,
OpenDKIM, OpenDMARC, glue it all together, and voil , I had a mail
server. Oh, and spampd. Oh, and I need the training data, oh, and this
and... I wasn't done and it was time to sleep.
The mail server went online this morning, and started accepting
mail. I tried syncing my laptop mail spool against it, but that failed
because Dovecot generated new UIDs for the emails, and isync correctly
failed to sync. I tried to copy the UIDs from the server in the office
(which I had still access to locally), but that somehow didn't work
either.
But at least the mail was getting delivered and stored properly. I
even had the Sieve rules setup so it would get sorted properly
too. Unfortunately, I didn't hook that up properly, so those didn't
actually get sorted. Thankfully, Dovecot can re-filter emails with the
sieve-filter command, so that was fixed later.
At this point, I started looking for other things to fix.
Web, again
I figured I was almost done with the website, might as well publish
it. So I installed the Nginx Debian package, got a cert with
certbot, and added the certs to the default configuration. I
rsync
'd my build in /var/www/html
and boom, I had a website. The
Goatcounter analytics were timing out, but that was easy to turn
off.
Resolution
Almost at that exact moment, a bang on the door told me mail was here
and I had the modem. I plugged it in and a few minutes later, marcos
was back online.
So this was a lot (a lot!) of work for basically nothing. I could
have just taken the day off and wait for the package to be
delivered. It would definitely have been better to make a simpler mail
exchanger to spool the mail to avoid losing it. And in fact, that's
what I eventually ended up doing: I converted the linode server in a
mail relay to continue accepting mail with DNS propagates, but without
having to sort the mail out of there...
Right now I have about 200 mails in a mailbox that I need to move back
into marcos. Normally, this would just be a simple rsync, but because
both servers have accepted mail simultaneously, it's going to be
simpler to just move those exact mails on there. Because dovecot
helpfully names delivered files with the hostname it's running on,
it's easy to find those files and transfer them, basically:
rsync -v -n --files-from=<(ssh colette.anarc.at find Maildir -name '*colette*' ) colette.anarc.at: colette/
rsync -v -n --files-from=<(ssh colette.anarc.at find Maildir -name '*colette*' ) colette/ marcos.anarc.at:
Overall, the outage lasted about 24 hours, from 11:00EST (16:00UTC) on
2023-02-07 to the same time today.
Future work
I'll probably keep a mail relay to make those situations more
manageable in the future. At first I thought that mail filtering would
be a problem, but that happens post queue anyways and I don't bounce
mail based on Spamassassin, so back-scatter shouldn't be an issue.
I basically need Postfix, OpenDMARC, and Postgrey. I'm not even sure I
need OpenDKIM as the server won't process outgoing mail, so it doesn't
need to sign anything, just check incoming signatures, which
OpenDMARC can (probably?) do.
Thanks to everyone who supported me through this ordeal, you know who
you are (and I'm happy to give credit here if you want to be
deanonymized)!
Web, again
I figured I was almost done with the website, might as well publish
it. So I installed the Nginx Debian package, got a cert with
certbot, and added the certs to the default configuration. I
rsync
'd my build in /var/www/html
and boom, I had a website. The
Goatcounter analytics were timing out, but that was easy to turn
off.
Resolution
Almost at that exact moment, a bang on the door told me mail was here
and I had the modem. I plugged it in and a few minutes later, marcos
was back online.
So this was a lot (a lot!) of work for basically nothing. I could
have just taken the day off and wait for the package to be
delivered. It would definitely have been better to make a simpler mail
exchanger to spool the mail to avoid losing it. And in fact, that's
what I eventually ended up doing: I converted the linode server in a
mail relay to continue accepting mail with DNS propagates, but without
having to sort the mail out of there...
Right now I have about 200 mails in a mailbox that I need to move back
into marcos. Normally, this would just be a simple rsync, but because
both servers have accepted mail simultaneously, it's going to be
simpler to just move those exact mails on there. Because dovecot
helpfully names delivered files with the hostname it's running on,
it's easy to find those files and transfer them, basically:
rsync -v -n --files-from=<(ssh colette.anarc.at find Maildir -name '*colette*' ) colette.anarc.at: colette/
rsync -v -n --files-from=<(ssh colette.anarc.at find Maildir -name '*colette*' ) colette/ marcos.anarc.at:
Overall, the outage lasted about 24 hours, from 11:00EST (16:00UTC) on
2023-02-07 to the same time today.
Future work
I'll probably keep a mail relay to make those situations more
manageable in the future. At first I thought that mail filtering would
be a problem, but that happens post queue anyways and I don't bounce
mail based on Spamassassin, so back-scatter shouldn't be an issue.
I basically need Postfix, OpenDMARC, and Postgrey. I'm not even sure I
need OpenDKIM as the server won't process outgoing mail, so it doesn't
need to sign anything, just check incoming signatures, which
OpenDMARC can (probably?) do.
Thanks to everyone who supported me through this ordeal, you know who
you are (and I'm happy to give credit here if you want to be
deanonymized)!
rsync -v -n --files-from=<(ssh colette.anarc.at find Maildir -name '*colette*' ) colette.anarc.at: colette/
rsync -v -n --files-from=<(ssh colette.anarc.at find Maildir -name '*colette*' ) colette/ marcos.anarc.at:
Overall, the outage lasted about 24 hours, from 11:00EST (16:00UTC) on
2023-02-07 to the same time today.