An offline PKI enhances security by physically isolating the certificate
authority from network threats. A YubiKey is a low-cost solution to store a
root certificate. You also need an air-gapped environment to operate the root
CA.
Offline PKI backed up by 3 YubiKeys
This post describes an offline PKI system using the following components:
2 YubiKeys for the root CA (with a 20-year validity),
1 YubiKey for the intermediate CA (with a 5-year validity), and
It is possible to add more YubiKeys as a backup of the root CA if needed. This
is not needed for the intermediate CA as you can generate a new one if
the current one gets destroyed.
The software part
offline-pki is a small Python application to manage an offline PKI.
It relies on yubikey-manager to manage YubiKeys and cryptography for
cryptographic operations not executed on the YubiKeys. The application has some
opinionated design choices. Notably, the cryptography is hard-coded to use NIST
P-384 elliptic curve.
The first step is to reset all your YubiKeys:
$ offline-pkiyubikeyreset
This will reset the connected YubiKey. Are you sure? [y/N]: yNew PIN code:Repeat for confirmation:New PUK code:Repeat for confirmation:New management key ('.' to generate a random one):WARNING[pki-yubikey] Using random management key: e8ffdce07a4e3bd5c0d803aa3948a9c36cfb86ed5a2d5cf533e97b088ae9e629INFO[pki-yubikey] 0: Yubico YubiKey OTP+FIDO+CCID 00 00INFO[pki-yubikey] SN: 23854514INFO[yubikit.management] Device config writtenINFO[yubikit.piv] PIV application data reset performedINFO[yubikit.piv] Management key setINFO[yubikit.piv] New PUK setINFO[yubikit.piv] New PIN setINFO[pki-yubikey] YubiKey reset successful!
Then, generate the root CA and create as many copies as you want:
$ offline-pkicertificateroot--permittedexample.com
Management key for Root X:Plug YubiKey "Root X"...INFO[pki-yubikey] 0: Yubico YubiKey CCID 00 00INFO[pki-yubikey] SN: 23854514INFO[yubikit.piv] Data written to object slot 0x5fc10aINFO[yubikit.piv] Certificate written to slot 9C (SIGNATURE), compression=TrueINFO[yubikit.piv] Private key imported in slot 9C (SIGNATURE) of type ECCP384Copy root certificate to another YubiKey? [y/N]: yPlug YubiKey "Root X"...INFO[pki-yubikey] 0: Yubico YubiKey CCID 00 00INFO[pki-yubikey] SN: 23854514INFO[yubikit.piv] Data written to object slot 0x5fc10aINFO[yubikit.piv] Certificate written to slot 9C (SIGNATURE), compression=TrueINFO[yubikit.piv] Private key imported in slot 9C (SIGNATURE) of type ECCP384Copy root certificate to another YubiKey? [y/N]: n
Then, you can create an intermediate certificate with offline-pki yubikey
intermediate and use it to sign certificates by providing a CSR to offline-pki
certificate sign. Be careful and inspect the CSR before signing it, as only the
subject name can be overridden. Check the documentation for more details.
Get the available options using the --help flag.
The hardware part
To ensure the operations on the root and intermediate CAs are air-gapped,
a cost-efficient solution is to use an ARM64 single board computer. The Libre
Computer Sweet Potato SBC is a more open alternative to the well-known
Raspberry Pi.1
Libre Computer Sweet Potato SBC, powered by the AML-S905X SOC
I interact with it through an USB to TTL UART converter:
$ tio/dev/ttyUSB0
[16:40:44.546] tio v3.7[16:40:44.546] Press ctrl-t q to quit[16:40:44.555] Connected to /dev/ttyUSB0GXL:BL1:9ac50e:bb16dc;FEAT:ADFC318C:0;POC:1;RCY:0;SPI:0;0.0;CHK:0;TE: 36574BL2 Built : 15:21:18, Aug 28 2019. gxl g1bf2b53 - luan.yuan@droid15-szset vcck to 1120 mvset vddee to 1000 mvBoard ID = 4CPU clk: 1200MHz[ ]
The Nix glue
To bring everything together, I am using Nix with a Flake providing:
a package for the offline-pki application, with shell completion,
a development shell, including an editable version of the offline-pki application,
a NixOS module to setup the offline PKI, resetting the system at each boot,
a QEMU image for testing, and
an SD card image to be used on the Sweet Potato or another ARM64 SBC.
# Execute the application locally
nixrungithub:vincentbernat/offline-pki----help
# Run the application inside a QEMU VM
nixrungithub:vincentbernat/offline-pki\#qemu
# Build a SD card for the Sweet Potato or for the Raspberry Pi
nixbuild--systemaarch64-linuxgithub:vincentbernat/offline-pki\#sdcard.potato
nixbuild--systemaarch64-linuxgithub:vincentbernat/offline-pki\#sdcard.generic
# Get a development shell with the application
nixdevelopgithub:vincentbernat/offline-pki
The key for the root CA is not generated by the YubiKey. Using an
air-gapped computer is all the more important. Put it in a safe with the
YubiKeys when done!
To avoid needless typing, the fish shell features command abbreviations to
expand some words after pressing space. We can emulate such a feature with
Zsh:
# Definition of abbrev-alias for auto-expanding aliasestypeset-ga_vbe_abbrevations
abbrev-alias()alias$1_vbe_abbrevations+=($ 1%%\=*)
_vbe_zle-autoexpand()local-awords;words=($ (z)LBUFFER)if(($ #_vbe_abbrevations[(r)$ words[-1]]));thenzle_expand_alias
fizlemagic-space
zle-N_vbe_zle-autoexpand
bindkey-Memacs" "_vbe_zle-autoexpand
bindkey-Misearch" "magic-space
# Correct common typos(($+commands[git]))&&abbrev-aliasgti=git
(($+commands[grep]))&&abbrev-aliasgrpe=grep
(($+commands[sudo]))&&abbrev-aliassuod=sudo
(($+commands[ssh]))&&abbrev-aliasshs=ssh
# Save a few keystrokes(($+commands[git]))&&abbrev-aliasgls="git ls-files"(($+commands[ip]))&&abbrev-aliasip6='ip -6'abbrev-aliasipb='ip -brief'# Hard to remember options(($+commands[mtr]))&&abbrev-aliasmtrr='mtr -wzbe'
Here is a demo where gls is expanded to git ls-files after pressing space:
Auto-expanding gls to git ls-files
I don t auto-expand all aliases. I keep using regular aliases when slightly
modifying the behavior of a command or for well-known abbreviations:
Caddy is an open-source web server written in Go. It handles TLS
certificates automatically and comes with a simple configuration syntax. Users
can extend its functionality through plugins1 to add features like
rate limiting, caching, and Docker integration.
While Caddy is available in Nixpkgs, adding extra plugins is not
simple.2 The compilation process needs Internet access, which Nix
denies during build to ensure reproducibility. When trying to build the
following derivation using xcaddy, a tool for building Caddy with plugins,
it fails with this error: dial tcp: lookup proxy.golang.org on [::1]:53:
connection refused.
Fixed-output derivations are an exception to this rule and get network access
during build. They need to specify their output hash. For example, the
fetchurl function produces a fixed-output derivation:
To create a fixed-output derivation, you need to set the outputHash
attribute. The example below shows how to output Caddy s source
code, with some plugin enabled, as a fixed-output derivation using xcaddy and
go mod vendor.
pkgs.stdenvNoCC.mkDerivation recpname="caddy-src-with-xcaddy";version="2.8.4";nativeBuildInputs=with pkgs;[ go xcaddy cacert ];unpackPhase="true";buildPhase='' export GOCACHE=$TMPDIR/go-cache export GOPATH="$TMPDIR/go" XCADDY_SKIP_BUILD=1 TMPDIR="$PWD" \ xcaddy build v$ version --with github.com/caddy-dns/powerdns@v1.0.1 (cd buildenv* && go mod vendor) '';installPhase='' mv buildenv* $out '';outputHash="sha256-F/jqR4iEsklJFycTjSaW8B/V3iTGqqGOzwYBUXxRKrc=";outputHashAlgo="sha256";outputHashMode="recursive";
With a fixed-output derivation, it is up to us to ensure the output is always
the same:
we ask xcaddy to not compile the program and keep the source code,3
we pin the version of Caddy we want to build, and
we pin the version of each requested plugin.
You can use this derivation to override the src attribute in pkgs.caddy:
Update (2024-11)
This flake won t work with Nixpkgs 24.05 or older because
it relies on this
commit
to properly override the vendorHash attribute.
This article uses the term plugins, though Caddy documentation
also refers to them as modules since they are implemented as Go modules.
This is a feature request since quite some time. A proposed
solution has been rejected. The one described in this article is a bit
different and I have proposed it in another pull request.
This is not perfect: if the source code produced by xcaddy changes,
the hash would change and the build would fail.
In 2020, Google introduced Core Web Vitals metrics to measure some aspects
of real-world user experience on the web. This blog has consistently achieved
good scores for two of these metrics: Largest Contentful Paint and
Interaction to Next Paint. However, optimizing the third metric, Cumulative
Layout Shift, which measures unexpected layout changes, has been more
challenging. Let s face it: optimizing for this metric is not really useful for
a site like this one. But getting a better score is always a good distraction.
To prevent the flash of invisible text when using web fonts, developers should
set the font-display property to swap in @font-face rules. This method
allows browsers to initially render text using a fallback font, then replace it
with the web font after loading. While this improves the LCP score, it causes
content reflow and layout shifts if the fallback and web fonts are not
metrically compatible. These shifts negatively affect the CLS score. CSS
provides properties to address this issue by overriding font metrics when using
fallback fonts: size-adjust,
ascent-override, descent-override,
and line-gap-override.
Two comprehensive articles explain each property and their computation methods
in detail: Creating Perfect Font Fallbacks in CSS and Improved font
fallbacks.
Interactive tuning tool
Instead of computing each property from font average metrics, I put together a
tool for interactively tuning fallback fonts.1
Instructions
Load your custom font.
Select a fallback font to tune.
Adjust the size-adjust property to match the width of your custom font with
the fallback font. With a proportional font, it is not possible to achieve a
perfect match.
Fine-tune the ascent-override property. Aim to align the final dot of the
last paragraph while monitoring the font s baseline. For more precise
adjustment, disable the option.
Modify the descent-override property. The goal is to make the two boxes
match. You may need to alternate between this and the previous property for
optimal results.
If necessary, adjust the line-gap-override property. This step is typically
not required.
The process needs to be repeated for each fallback font. Some platforms may not
include certain fonts. Notably, Android lacks most fonts found in other
operating systems. It replaces Georgia with Noto Serif, which is not
metrically-compatible.
Tool
This tool is not available from the Atom feed.
Results
For the body text of this blog, I get the following CSS definition:
@font-facefont-family:Merriweather;font-style:normal;font-weight:400;src:url("../fonts/merriweather.woff2")format("woff2");font-display:swap;@font-facefont-family:"Fallback for Merriweather";src:local("Noto Serif"),local("Droid Serif");size-adjust:98.3%;ascent-override:99%;descent-override:27%;@font-facefont-family:"Fallback for Merriweather";src:local("Georgia");size-adjust:106%;ascent-override:90.4%;descent-override:27.3%;font-family:Merriweather,"Fallback for Merriweather",serif;
After a month, the CLS metric improved to 0:
Recent Core Web Vitals scores for vincent.bernat.ch
About custom fonts
Using safe web fonts or a modern font stack is often simpler. However, I
prefer custom web fonts. Merriweather and Iosevka, which are used in
this blog, enhance the reading experience. An alternative approach could be to
use Georgia as a serif option. Unfortunately, most default monospace fonts are
ugly.
Furthermore, paragraphs that combine proportional and monospace fonts can create
visual disruption. This occurs due to mismatched vertical metrics or weights. To
address this issue, I adjust Iosevka s metrics and weight to align with
Merriweather s characteristics.
Similar tools already exist, like the Fallback Font Generator,
but they were missing a few features, such as the ability to load the
fallback font or to have decimals for the CSS properties. And no source
code.
Combining BGP confederations and AS override can potentially
create a BGP routing loop, resulting in an indefinitely expanding AS path.
BGP confederation is a technique used to reduce the number of iBGP sessions
and improve scalability in large autonomous systems (AS). It divides an AS into
sub-ASes. Most eBGP rules apply between sub-ASes, except that next-hop, MED, and
local preferences remain unchanged. The AS path length ignores contributions
from confederation sub-ASes. BGP confederation is rarely used and BGP route
reflection is typically preferred for scaling.
AS override is a feature that allows a router to replace the ASN of a
neighbor in the AS path of outgoing BGP routes with its own. It s useful when
two distinct autonomous systems share the same ASN. However, it interferes with
BGP s loop prevention mechanism and should be used cautiously. A safer
alternative is the allowas-in directive.1
In the example below, we have four routers in a single confederation, each in
its own sub-AS. R0 originates the 2001:db8::1/128 prefix. R1, R2, and
R3 forward this prefix to the next router in the loop.
BGP routing loop using a confederation
The router configurations are available in a Git repository. They are
running Cisco IOS XR. R2 uses the following configuration for BGP:
The session with R3 uses both as-override and next-hop-self directives.
The latter is only necessary to make the announced prefix valid, as there is no
IGP in this example.2
Here s the sequence of events leading to an infinite AS path:
R1 selects it as the best path, forwarding it to R2 with AS
path (64501 64500).
R2 selects it as the best path, forwarding it to R3 with AS
path (64500 64501 64502).
R3 selects it as the best path. It would forward it to R1 with AS path
(64503 64502 64501 64500), but due to AS override, it substitutes R1 s
ASN with its own, forwarding it with AS path (64503 64502 64503 64500).
R1 accepts the prefix, as its own ASN is not in the AS path. It compares
this new prefix with the one from R0. Both (64500) and (64503 64502
64503 64500) have the same length because confederation sub-ASes don t
contribute to AS path length. The first tie-breaker is the router ID.
R0 s router ID (1.0.0.4) is higher than R3 s (1.0.0.3). The new
prefix becomes the best path and is forwarded to R2 with AS path (64501
64503 64501 64503 64500).
R2 receives the new prefix, replacing the old one. It selects it as the
best path and forwards it to R3 with AS path (64502 64501 64502 64501
64502 64500).
R3 receives the new prefix, replacing the old one. It selects it as the
best path and forwards it to R0 with AS path (64503 64502 64503 64502
64503 64502 64500).
R1 receives the new prefix, replacing the old one. Again, it competes with
the prefix from R0, and again the new prefix wins due to the lower router
ID. The prefix is forwarded to R2 with AS path (64501 64503 64501 64503
64501 64503 64501 64500).
A few iterations later, R1 views the looping prefix as follows:4
RP/0/RP0/CPU0:R1#showbgpipv6u2001:db8::1/128bestpath-compare
BGP routing table entry for 2001:db8::1/128Last Modified: Jul 28 10:23:05.560 for 00:00:00Paths: (2 available, best #2) Path #1: Received by speaker 0 Not advertised to any peer (64500) 2001:db8::1:0 from 2001:db8::1:0 (1.0.0.4), if-handle 0x00000000 Origin IGP, metric 0, localpref 100, valid, confed-external Received Path ID 0, Local Path ID 0, version 0 Higher router ID than best path (path #2) Path #2: Received by speaker 0 Advertised IPv6 Unicast paths to peers (in unique update groups): 2001:db8::2:1 (64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64503 64502 64500) 2001:db8::4:0 from 2001:db8::4:0 (1.0.0.3), if-handle 0x00000000 Origin IGP, metric 0, localpref 100, valid, confed-external, best, group-best Received Path ID 0, Local Path ID 1, version 37 best of AS 64503, Overall best
There s no upper bound for an AS path, but BGP messages have size limits (4096
bytes per RFC 4271 or 65535 bytes per RFC 8654). At some
point, BGP updates can t be generated. On Cisco IOS XR, the BGP process crashes
well before reaching this limit.5
The main lessons from this tale are:
never use BGP confederations under any circumstances, and
be cautious of features that weaken BGP routing loop detection.
When using BGP confederations with Cisco IOS XR, use
allowconfedas-in instead. It s available since IOS XR 7.11.
Using BGP confederations is already inadvisable. If you don t use the
same IGP for all sub-ASes, you re inviting trouble! However, the scenario
described here is also possible with an IGP.
When an AS path segment is composed of ASNs from a confederation, it
is displayed between parentheses.
By default, IOS XR paces eBGP updates. This is controlled by the
advertisement-interval directive. Its default value is 30 seconds for eBGP
peers (even in the same confederation). R1 and R2 set this value to 0,
while R3 sets it to 2 seconds. This gives some time to watch the AS path
grow.
This is CSCwk15887. It only happens when using as-override on an
AS path with a too long AS_CONFED_SEQUENCE. This should be fixed around
24.3.1.
IPv4 is an expensive resource. However, many content providers are still
IPv4-only. The most common reason is that IPv4 is here to stay and IPv6 is an
additional complexity.1 This mindset may seem selfish, but there are
compelling reasons for a content provider to enable IPv6, even when they have
enough IPv4 addresses available for their needs.
Disclaimer
It s been a while since this article has been in my drafts. I
started it when I was working at Shadow, a content
provider, while I now work for Free, an internet service
provider.
Why ISPs need IPv6?
Providing a public IPv4 address to each customer is quite costly when each IP
address costs US$40 on the market. For fixed access, some consumer ISPs are
still providing one IPv4 address per customer.2 Other ISPs provide, by
default, an IPv4 address shared among several customers. For mobile access, most
ISPs distribute a shared IPv4 address.
There are several methods to share an IPv4 address:3
NAT44
The customer device is given a private IPv4 address, which is translated to a public
one by a service provider device. This device needs to maintain a state for each
translation.
464XLAT and DS-Lite
The customer device translates the private IPv4 address to an IPv6 address or
encapsulates IPv4 traffic in IPv6 packets. The provider device then translates
the IPv6 address to a public IPv4 address. It still needs to maintain a state
for the NAT64 translation.
Lightweight IPv4 over IPv6, MAP-E, and MAP-T
The customer device encapsulates IPv4 in IPv6 packets or performs a stateless
NAT46 translation. The provider device uses a binding table or an algorithmic
rule to map IPv6 tunnels to IPv4 addresses and ports. It does not need to
maintain a state.
Solutions to share an IPv4 address across several customers. Some of them require the ISP to keep state, some don't.
All these solutions require a translation device in the ISP s network. This
device represents a non-negligible cost in terms of money and reliability. As
half of the top 1000 websites support IPv6 and the biggest players can
deliver most of their traffic using IPv6,4 ISPs have a clear path to
reduce the cost of translation devices: provide IPv6 by default to their
customers.
Why content providers need IPv6?
Content providers should expose their services over IPv6 primarily to avoid going
through the ISP s translation devices. This doesn t help users who
don t have IPv6 or users with a non-shared IPv4 address, but it provides a
better service for all the others.
Why would the service be better delivered over IPv6 than over IPv4 when a
translation device is in the path? There are two main reasons for that:5
Translation devices introduce additional latency due to their geographical
placement inside the network: it is easier and cheaper to only install these
devices at a few points in the network instead of putting them close to the
users.
Translation devices are an additional point of failure in the path between
the user and the content. They can become overloaded or malfunction.
Moreover, as they are not used for the five most visited websites, which
serve their traffic over IPv6, the ISPs may not be incentivized to ensure
they perform as well as the native IPv6 path.
Looking at Google statistics, half of the users reach Google over IPv6.
Moreover, their latency is lower.6 In the US, all the nationwide mobile
providers have IPv6 enabled.
For France, we can refer to the annual ARCEP report: in 2022, 72% of fixed
users and 60% of mobile users had IPv6 enabled, with projections of 94% and 88%
for 2025. Starting from this projection, since all mobile users go through a
network translation device, content providers can deliver a better service for
88% of them by exposing their services over IPv6. If we exclude Orange, which
has 40% of the market share on consumer fixed access, enabling IPv6 should
positively impact more than 55% of fixed access users.
In conclusion, content providers aiming for the best user experience should
expose their services over IPv6. By avoiding translation devices, they can
ensure fast and reliable content delivery. This is crucial for latency-sensitive
applications, like live streaming, but also for websites in competitive markets,
where even slight delays can lead to user disengagement.
A way to limit this complexity is to build IPv6 services and only
provide IPv4 through reverse proxies at the edge.
In France, this includes non-profit ISPs, like FDN and
Milkywan. Additionally, Orange, the previously state-owned telecom
provider, supplies non-shared IPv4 addresses. Free also provides a dedicated
IPv4 address for customers connected to the point-to-point FTTH access.
I use the term NAT instead of the more correct term NAPT. Feel free to
do a mental substitution. If you are curious, check RFC 2663. For a
survey of the IPv6 transition technologies enumerated here, have a look at
RFC 9313.
For AS 12322, Google, Netflix, and Meta are delivering 85% of
their traffic over IPv6. Also, more than half of our traffic is delivered
over IPv6.
An additional reason is for fighting abuse: blacklisting an IPv4
address may impact unrelated users who share the same IPv4 as the culprits.
IPv6 may not be the sole reason the latency is lower: users with IPv6
generally have a better connection.
SSH offers several forms of authentication, such as passwords and
public keys. The latter are considered more secure. However, password
authentication remains prevalent, particularly with network equipment.1
A classic solution to avoid typing a password for each connection is
sshpass, or its more correct variant passh. Here is a wrapper for Zsh,
getting the password from pass, a simple password manager:2
This approach is a bit brittle as it requires to parse the output of the ssh
command to look for a password prompt. Moreover, if no password is required, the
password manager is still invoked. Since OpenSSH 8.4, we can use
SSH_ASKPASS and SSH_ASKPASS_REQUIRE instead:
ssh()set-olocaloptions-olocaltraps
localpassname=network/ssh/password
localhelper=$(mktemp)trap"command rm -f $helper"EXITINT
>$helper<<EOF#!$SHELLpass show $passname head -1EOFchmodu+x$helperSSH_ASKPASS=$helperSSH_ASKPASS_REQUIRE=forcecommandssh"$@"
If the password is incorrect, we can display a prompt on the second
tentative:
ssh()set-olocaloptions-olocaltraps
localpassname=network/ssh/password
localhelper=$(mktemp)trap"command rm -f $helper"EXITINT
>$helper<<EOF#!$SHELLif [ -k $helper ]; then oldtty=\$(stty -g) trap 'stty \$oldtty < /dev/tty 2> /dev/null' EXIT INT TERM HUP stty -echo print "\rpassword: " read password printf "\n" > /dev/tty < /dev/tty printf "%s" "\$password"else pass show $passname head -1 chmod +t $helperfiEOFchmodu+x$helperSSH_ASKPASS=$helperSSH_ASKPASS_REQUIRE=forcecommandssh"$@"
A possible improvement is to use a different password entry depending on the
remote host:3
ssh()# Grab login informationlocal-Adetails
details=($ =$ (M)$ :-"$ (@f)$(commandssh-G"$@"2>/dev/null)":#(host hostname user) *)localremote=$ details[host]:-details[hostname]locallogin=$ details[user]@$ remote# Get password namelocalpassname
case"$login"inadmin@*.example.net)passname=company1/ssh/admin;;bernat@*.example.net)passname=company1/ssh/bernat;;backup@*.example.net)passname=company1/ssh/backup;;esac# No password name? Just use regular SSH[[-z$passname]]&&commandssh"$@"return$?# Invoke SSH with the helper for SSH_ASKPASS# [ ]
It is also possible to make scp invoke our custom ssh function:
For the complete code, have a look at my zshrc. As an alternative, you can
put the ssh() function body into its own script file and replace command ssh
with /usr/bin/ssh to avoid an unwanted recursive call. In this case, the
scp() function is not needed anymore.
First, some vendors make it difficult to associate an SSH key with a
user. Then, many vendors do not support certificate-based
authentication, making it difficult to scale. Finally, interactions between
public-key authentication and finer-grained authorization methods like
TACACS+ and Radius are still uncharted territory.
The clear-text password never appears on the command line, in the
environment, or on the disk, making it difficult for a third party without
elevated privileges to capture it. On Linux, Zsh provides the password
through a file descriptor.
To decipher the fourth line, you may get help from print -l and the
zshexpn(1) manual page. details is an associative array defined
from an array alternating keys and values.
Akvorado collects sFlow and IPFIX flows, stores them in a
ClickHouse database, and presents them in a web console. Although it lacks
built-in DDoS detection, it s possible to create one by crafting custom
ClickHouse queries.
DDoS detection
Let s assume we want to detect DDoS targeting our customers. As an example, we
consider a DDoS attack as a collection of flows over one minute targeting a
single customer IP address, from a single source port and matching one
of these conditions:
an average bandwidth of 1 Gbps,
an average bandwidth of 200 Mbps when the protocol is UDP,
more than 20 source IP addresses and an average bandwidth of 100 Mbps, or
more than 10 source countries and an average bandwidth of 100 Mbps.
Here is the SQL query to detect such attacks over the last 5 minutes:
DDoS remediation
Once detected, there are at least two ways to stop the attack at the network
level:
blackhole the traffic to the targeted user (RTBH), or
selectively drop packets matching the attack patterns (Flowspec).
Traffic blackhole
The easiest method is to sacrifice the attacked user. While this helps the
attacker, this protects your network. It is a method supported by all routers.
You can also offload this protection to many transit providers. This is useful
if the attack volume exceeds your internet capacity.
This works by advertising with BGP a route to the attacked user with a specific
community. The border router modifies the next hop address of these routes to a
specific IP address configured to forward the traffic to a null interface. RFC 7999 defines 65535:666 for this purpose. This is known as a
remote-triggered blackhole (RTBH) and is explained in more detail in RFC 3882.
It is also possible to blackhole the source of the attacks by leveraging
unicast Reverse Path Forwarding (uRPF) from RFC 3704, as explained in RFC 5635. However, uRPF can be a serious tax on your router resources. See
NCS5500 uRPF: Configuration and Impact on Scale for an example of the kind
of restrictions you have to expect when enabling uRPF.
On the advertising side, we can use BIRD. Here is a complete configuration
file to allow any router to collect them:
log stderr all;
router id 192.0.2.1;protocol device
scan time 10;protocol bgp exporter
ipv4
import none;
export where proto = "blackhole4"; ;
ipv6
import none;
export where proto = "blackhole6"; ;
local as 64666; neighbor range 192.0.2.0/24 external;
multihop;
dynamic name "exporter";
dynamic name digits 2;
graceful restart yes;
graceful restart time 0; long lived graceful restart yes;
long lived stale time 3600;# keep routes for 1 hour!protocol static blackhole4
ipv4;
route 203.0.113.206/32 blackhole bgp_community.add((65535, 666));
; route 203.0.113.68/32 blackhole bgp_community.add((65535, 666));
;protocol static blackhole6
ipv6;
We use BGP long-lived graceful restart to ensure routes are kept for
one hour, even if the BGP connection goes down, notably during maintenance.
On the receiver side, if you have a Cisco router running IOS XR, you can use the
following configuration to blackhole traffic received on the BGP session. As the
BGP session is dedicated to this usage, The community is not used, but you can
also forward these routes to your transit providers.
router static
vrf public
address-family ipv4 unicast
192.0.2.1/32 Null0 description "BGP blackhole" !address-family ipv6 unicast
2001:db8::1/128 Null0 description "BGP blackhole" ! !!route-policy blackhole_ipv4_in_public
if destination in (0.0.0.0/0 le 31) then
dropendifset next-hop 192.0.2.1doneend-policy!route-policy blackhole_ipv6_in_public
if destination in (::/0 le 127) then
dropendifset next-hop 2001:db8::1doneend-policy!router bgp 12322neighbor-group BLACKHOLE_IPV4_PUBLIC
remote-as64666ebgp-multihop255update-source Loopback10
address-family ipv4 unicast
maximum-prefix10090route-policy blackhole_ipv4_in_public in
route-policy drop out
long-lived-graceful-restart stale-time send 86400 accept 86400 !address-family ipv6 unicast
maximum-prefix10090route-policy blackhole_ipv6_in_public in
route-policy drop out
long-lived-graceful-restart stale-time send 86400 accept 86400 ! !vrf public
neighbor192.0.2.1use neighbor-group BLACKHOLE_IPV4_PUBLIC
description akvorado-1
When the traffic is blackholed, it is still reported by IPFIX and sFlow.
In Akvorado, use ForwardingStatus >= 128 as a filter.
While this method is compatible with all routers, it makes the attack successful
as the target is completely unreachable. If your router supports it, Flowspec
can selectively filter flows to stop the attack without impacting the
customer.
FlowspecFlowspec is defined in RFC 8955 and enables the transmission of flow
specifications in BGP sessions. A flow specification is a set of matching
criteria to apply to IP traffic. These criteria include the source and
destination prefix, the IP protocol, the source and destination port, and the
packet length. Each flow specification is associated with an action, encoded as an
extended community: traffic shaping, traffic marking, or redirection.
To announce flow specifications with BIRD, we extend our configuration. The
extended community used shapes the matching traffic to 0 bytes per second.
flow4 table flowtab4;
flow6 table flowtab6;
protocol bgp exporter
flow4
import none;
export where proto = "flowspec4"; ;
flow6
import none;
export where proto = "flowspec6"; ;# [ ]protocol static flowspec4
flow4;
route flow4
dst 203.0.113.68/32;
sport = 53;
length >= 1476 && <= 1500;
proto = 17;
bgp_ext_community.add((generic, 0x80060000, 0x00000000));
;
route flow4
dst 203.0.113.206/32;
sport = 123;
length = 468;
proto = 17;
bgp_ext_community.add((generic, 0x80060000, 0x00000000));
;protocol static flowspec6
flow6;
If you have a Cisco router running IOS XR, the configuration may look like
this:
vrf public
address-family ipv4 flowspec
address-family ipv6 flowspec
!router bgp 12322address-family vpnv4 flowspec
address-family vpnv6 flowspec
neighbor-group FLOWSPEC_IPV4_PUBLIC
remote-as64666ebgp-multihop255update-source Loopback10
address-family ipv4 flowspec
long-lived-graceful-restart stale-time send 86400 accept 86400route-policy accept in
route-policy drop out
maximum-prefix10090validation disable
!address-family ipv6 flowspec
long-lived-graceful-restart stale-time send 86400 accept 86400route-policy accept in
route-policy drop out
maximum-prefix10090validation disable
! !vrf public
address-family ipv4 flowspec
address-family ipv6 flowspec
neighbor192.0.2.1use neighbor-group FLOWSPEC_IPV4_PUBLIC
description akvorado-1
Then, you need to enable Flowspec on all interfaces with:
As with the RTBH setup, you can filter dropped flows with ForwardingStatus >=
128.
DDoS detection (continued)
In the example using Flowspec, the flows were also filtered on the length of the packet:
route flow4
dst 203.0.113.68/32;
sport = 53;
length >= 1476 && <= 1500;
proto = 17;
bgp_ext_community.add((generic, 0x80060000, 0x00000000));
;
This is an important addition: legitimate DNS requests are smaller than this and
therefore not filtered.2 With ClickHouse, you can get the 10th
and 90th percentiles of the packet sizes with quantiles(0.1,
0.9)(Bytes/Packets).
The last issue we need to tackle is how to optimize the request: it may need
several seconds to collect the data and it is likely to consume substantial
resources from your ClickHouse database. One solution is to create a
materialized view to pre-aggregate results:
The ddos_logs table is using the SummingMergeTree engine. When the table
receives new data, ClickHouse replaces all the rows with the same sorting key,
as defined by the ORDER BY directive, with one row which contains summarized
values using either the sum() function or the explicitly specified aggregate
function (uniqCombined and quantiles in our example).3
Finally, we can modify our initial query with the following one:
Gluing everything together
To sum up, building an anti-DDoS system requires to following these steps:
define a set of criteria to detect a DDoS attack,
translate these criteria into SQL requests,
pre-aggregate flows into SummingMergeTree tables,
query and transform the results to a BIRD configuration file, and
configure your routers to pull the routes from BIRD.
A Python script like the following one can handle the fourth step. For each
attacked target, it generates both a Flowspec rule and a blackhole route.
importsocketimporttypesfromclickhouse_driverimportClientasCHClient# Put your SQL query here!SQL_QUERY=" "# How many anti-DDoS rules we want at the same time?MAX_DDOS_RULES=20defempty_ruleset():ruleset=types.SimpleNamespace()ruleset.flowspec=types.SimpleNamespace()ruleset.blackhole=types.SimpleNamespace()ruleset.flowspec.v4=[]ruleset.flowspec.v6=[]ruleset.blackhole.v4=[]ruleset.blackhole.v6=[]returnrulesetcurrent_ruleset=empty_ruleset()client=CHClient(host="clickhouse.akvorado.net")whileTrue:results=client.execute(SQL_QUERY)seen=new_ruleset=empty_ruleset()for(t,addr,proto,port,gbps,mpps,sources,countries,size)inresults:if(addr,proto,port)inseen:continueseen[(addr,proto,port)]=True# Flowspecifaddr.ipv4_mapped:address=addr.ipv4_mappedrules=new_ruleset.flowspec.v4table="flow4"mask=32nh="proto"else:address=addrrules=new_ruleset.flowspec.v6table="flow6"mask=128nh="next header"ifsize[0]==size[1]:length=f"length = int(size[0])"else:length=f"length >= int(size[0]) && <= int(size[1])"header=f"""# Time: t# Source: address, protocol: proto, port: port# Gbps/Mpps: gbps:.3/mpps:.3, packet size: int(size[0])<=X<=int(size[1])# Flows: flows, sources: sources, countries: countries"""rules.append(f"""headerroute table dst address/mask; sport = port;length;nh = socket.getprotobyname(proto); bgp_ext_community.add((generic, 0x80060000, 0x00000000));;""")# Blackholeifaddr.ipv4_mapped:rules=new_ruleset.blackhole.v4else:rules=new_ruleset.blackhole.v6rules.append(f"""headerroute address/mask blackhole bgp_community.add((65535, 666));;""")new_ruleset.flowspec.v4=list(set(new_ruleset.flowspec.v4[:MAX_DDOS_RULES]))new_ruleset.flowspec.v6=list(set(new_ruleset.flowspec.v6[:MAX_DDOS_RULES]))# TODO: advertise changes by mail, chat, ...current_ruleset=new_rulesetchanges=Falseforrules,pathin((current_ruleset.flowspec.v4,"v4-flowspec"),(current_ruleset.flowspec.v6,"v6-flowspec"),(current_ruleset.blackhole.v4,"v4-blackhole"),(current_ruleset.blackhole.v6,"v6-blackhole"),):path=os.path.join("/etc/bird/",f"path.conf")withopen(f"path.tmp","w")asf:forrinrules:f.write(r)changes=(changesornotos.path.exists(path)ornotsamefile(path,f"path.tmp"))os.rename(f"path.tmp",path)ifnotchanges:continueproc=subprocess.Popen(["birdc","configure"],stdin=subprocess.DEVNULL,stdout=subprocess.PIPE,stderr=subprocess.PIPE,)stdout,stderr=proc.communicate(None)stdout=stdout.decode("utf-8","replace")stderr=stderr.decode("utf-8","replace")ifproc.returncode!=0:logger.error(" error:\n\n".format("birdc reconfigure","\n".join([" O: ".format(line)forlineinstdout.rstrip().split("\n")]),"\n".join([" E: ".format(line)forlineinstderr.rstrip().split("\n")]),))
Until Akvorado integrates DDoS detection and mitigation, the ideas presented
in this blog post provide a solid foundation to get started with your own
anti-DDoS system.
ClickHouse can export results using Markdown format when
appending FORMAT Markdown to the query.
While most DNS clients should retry with TCP on failures, this is not
always the case: until recently, musl libc did not implement this.
The materialized view also aggregates the data at hand, both
for efficiency and to ensure we work with the right data types.
Akvorado collects network flows using IPFIX or sFlow. It stores them
in a ClickHouse database. A web console allows a user to query the data and
plot some graphs. A nice aspect of this console is how we can filter flows with
a SQL-like language:
Filter editor in Akvorado console
Often, web interfaces expose a querybuilder to build such filters. I think combining a
SQL-like language with an editor supporting completion, syntax
highlighting, and linting is a better approach.1
The language parser is built with pigeon (Go) from a parsing expression
grammar or PEG. The editor component is CodeMirror (TypeScript).
Language parser
PEG grammars are relatively recent2 and are an alternative to
context-free grammars. They are easier to write and they can generate better
error messages. Python switched from an LL(1)-based parser to a PEG-based
parser in Python 3.9.
pigeon generates a parser for Go. A grammar is a set of rules. Each rule is
an identifier, with an optional user-friendly label for error messages, an
expression, and an action in Go to be executed on match. You can find the
complete grammar in parser.peg. Here is a simplified rule:
The rule identifier is ConditionIPExpr. It case-insensitively matches
ExporterAddress, SrcAddr, or DstAddr. The action for each case returns the
proper case for the column name. That s what is stored in the column variable.
Then, it matches one of the possible operators. As there is no code block, it
stores the matched string directly in the operator variable. Then, it tries to
match the IP rule, which is defined elsewhere in the grammar. If it succeeds,
it stores the result of the match in the ip variable and executes the final
action. The action turns the column, operator, and IP into a proper expression
for ClickHouse. For example, if we have ExporterAddress = 203.0.113.15, we
get ExporterAddress = IPv6StringToNum('203.0.113.15').
The IP rule uses a rudimentary regular expression but checks if the matched
address is correct in the action block, thanks to netip.ParseAddr():
IP "IP address"[0-9A-Fa-f:.]+ ip,err:=netip.ParseAddr(string(c.text))iferr!=nilreturn"",errors.New("expecting an IP address")returnip.String(),nil
Our parser safely turns the filter into a WHERE clause accepted by
ClickHouse:3
Integration in CodeMirrorCodeMirror is a versatile code editor that can be easily integrated into
JavaScript projects. In Akvorado, the Vue.js component,
InputFilter, uses CodeMirror as its foundation and
leverages features such as syntax highlighting, linting, and completion. The
source code for these capabilities can be found in the
codemirror/lang-filter/ directory.
Syntax highlighting
The PEG grammar for Go cannot be utilized directly4 and the requirements
for parsers for editors are distinct: they should be error-tolerant and operate
incrementally, as code is typically updated character by character. CodeMirror
offers a solution through its own parser generator, Lezer.
We don t need this additional parser to fully understand the filter language.
Only the basic structure is needed: column names, comparison and logic
operators, quoted and unquoted values. The grammar is therefore quite short
and does not need to be updated often:
@topFilter
expression
expression
Not expression
"(" expression ")""(" expression ")" And expression
"(" expression ")" Or expression
comparisonExpression And expression
comparisonExpression Or expression
comparisonExpression
comparisonExpression
Column Operator Value
Value
String Literal ValueLParen ListOfValues ValueRParen
ListOfValues
ListOfValues ValueComma (String Literal)
String Literal
// [ ]@tokens// [ ]Column std.asciiLetter (std.asciiLetter std.digit)* Operator $[a-zA-Z!=><]+ String'"'(![\\\n"]"\\" _)* '"'?
"'"(![\\\n']"\\" _)* "'"?
Literal(std.digit std.asciiLetter $[.:/])+ // [ ]
The expression SrcAS = 12322 AND (DstAS = 1299 OR SrcAS = 29447) is parsed to:
Linting
We offload linting to the original parser in Go. The
/api/v0/console/filter/validate endpoint accepts a filter and returns a JSON
structure with the errors that were found:
"message":"at line 1, position 12: string literal not terminated","errors":[ "line":1,"column":12,"offset":11,"message":"string literal not terminated", ]
Completion
The completion system takes a hybrid approach. It splits the work between the
frontend and the backend to offer useful suggestions for completing filters.
The frontend uses the parser built with Lezer to determine the context of
the completion: do we complete a column name, an operator, or a value? It also
extracts the column name if we are completing something else. It forwards the
result to the backend through the /api/v0/console/filter/complete endpoint.
Walking the syntax tree was not as easy as I thought, but unit tests helped
a lot.
The backend uses the parser generated by pigeon to complete a column name
or a comparison operator. For values, the completions are either static or
extracted from the ClickHouse database. A user can complete an AS number from
an organization name thanks to the following snippet:
results:=[]structLabelstring ch:"label" Detailstring ch:"detail" columnName:="DstAS"sqlQuery:=fmt.Sprintf( SELECT concat('AS', toString(%s)) AS label, dictGet('asns', 'name', %s) AS detail FROM flows WHERE TimeReceived > date_sub(minute, 1, now()) AND detail != '' AND positionCaseInsensitive(detail, $1) >= 1 GROUP BY label, detail ORDER BY COUNT(*) DESC LIMIT 20,columnName,columnName)iferr:=conn.Select(ctx,&results,sqlQuery,input.Prefix);err!=nilc.r.Err(err).Msg("unable to query database")breakfor_,result:=rangeresultscompletions=append(completions,filterCompletionLabel:result.Label,Detail:result.Detail,Quoted:false, )
In my opinion, the completion system is a major factor in making the field
editor an efficient way to select flows. While a query builder may have been
more beginner-friendly, the completion system s ease of use and functionality
make it more enjoyable to use once you become familiar.
Moreover, building a query builder did not seem like a fun task for me.
My toilet is equipped with a Geberit Sigma 70 flush plate. The sales pitch
for this hydraulic-assisted device praises the
ingenious mount that acts like a rocker switch. In practice, the flush is very
capricious and has a very high failure rate. Avoid this type of
mechanism! Prefer a fully mechanical version like the Geberit Sigma 20.
After several plumbers, exchanges with Geberit s technical department, and the
expensive replacement of the entire mechanism, I was still getting a failure rate
of over 50% for the small flush. I finally managed to decrease this rate to 5%
by applying two 8 mm silicone bumpers on the back of the plate. Their
locations are indicated by red circles on the picture below:
Geberit Sigma 70 flush plate. Above: the mechanism installed on the wall. Below, the back of the glass plate. In red, the two places where to apply the silicone bumpers.
Expect to pay about 5 and as many minutes for this operation.
Protocol Buffers are a popular choice for serializing structured data
due to their compact size, fast processing speed, language independence, and
compatibility. There exist other alternatives, including Cap n Proto,
CBOR, and Avro.
Usually, data structures are described in a proto definition file
(.proto). The protoc compiler and a language-specific plugin convert it into
code:
Akvorado collects network flows using IPFIX or sFlow, decodes them
with GoFlow2, encodes them to Protocol Buffers, and sends them to
Kafka to be stored in a ClickHouse database. Collecting a new field,
such as source and destination MAC addresses, requires modifications in multiple
places, including the proto definition file and the ClickHouse migration code.
Moreover, the cost is paid by all users.1 It would be nice to have an
application-wide schema and let users enable or disable the fields they
need.
While the main goal is flexibility, we do not want to sacrifice performance. On
this front, this is quite a success: when upgrading from 1.6.4 to 1.7.1, the
decoding and encoding performance almost doubled!
Faster Protocol Buffers encoding
I use the following code to benchmark both the decoding and
encoding process. Initially, the Decode() method is a thin layer above
GoFlow2 producer and stores the decoded data into the in-memory structure
generated by protoc. Later, some of the data will be encoded directly during
flow decoding. This is why we measure both the decoding and the
encoding.2
The canonical Go implementation for Protocol Buffers,
google.golang.org/protobuf is not the most
efficient one. For a long time, people were relying on gogoprotobuf.
However, the project is now deprecated. A good replacement is
vtprotobuf.3
Dynamic Protocol Buffers encoding
We have our baseline. Let s see how to encode our Protocol Buffers without a
.proto file. The wire format is simple and rely a lot on variable-width
integers.
Variable-width integers, or varints, are an efficient way of encoding unsigned
integers using a variable number of bytes, from one to ten, with small values
using fewer bytes. They work by splitting integers into 7-bit payloads and using
the 8th bit as a continuation indicator, set to 1 for all payloads
except the last.
Variable-width integers encoding in Protocol Buffers
For our usage, we only need two types: variable-width
integers and byte sequences. A byte sequence is encoded by prefixing it by its
length as a varint. When a message is encoded, each key-value pair is turned
into a record consisting of a field number, a wire type, and a payload. The
field number and the wire type are encoded as a single variable-width integer
called a tag.
Message encoded with Protocol Buffers
We use the following low-level functions to build the output buffer:
Our schema abstraction contains the appropriate information to encode a message
(ProtobufIndex) and to generate a proto definition file (fields starting with
Protobuf):
typeColumnstructKeyColumnKeyNamestringDisabledbool// [ ]// For protobuf.ProtobufIndexprotowire.NumberProtobufTypeprotoreflect.Kind// Uint64Kind, Uint32Kind, ProtobufEnummap[int]stringProtobufEnumNamestringProtobufRepeatedbool
We have a few helper methods around the protowire functions to directly
encode the fields while decoding the flows. They skip disabled fields or
non-repeated fields already encoded. Here is an excerpt of the sFlow
decoder:
For fields that are required later in the pipeline, like source and destination
addresses, they are stored unencoded in a separate structure:
typeFlowMessagestructTimeReceiveduint64SamplingRateuint32// For exporter classifierExporterAddressnetip.Addr// For interface classifierInIfuint32OutIfuint32// For geolocation or BMPSrcAddrnetip.AddrDstAddrnetip.AddrNextHopnetip.Addr// Core component may override themSrcASuint32DstASuint32GotASPathbool// protobuf is the protobuf representation for the information not contained above.protobuf[]byteprotobufSetbitset.BitSet
The protobuf slice holds encoded data. It is initialized with a capacity of
500 bytes to avoid resizing during encoding. There is also some reserved room at
the beginning to be able to encode the total size as a variable-width integer.
Upon finalizing encoding, the remaining fields are added and the message length
is prefixed:
func(schema*Schema)ProtobufMarshal(bf*FlowMessage)[]byteschema.ProtobufAppendVarint(bf,ColumnTimeReceived,bf.TimeReceived)schema.ProtobufAppendVarint(bf,ColumnSamplingRate,uint64(bf.SamplingRate))schema.ProtobufAppendIP(bf,ColumnExporterAddress,bf.ExporterAddress)schema.ProtobufAppendVarint(bf,ColumnSrcAS,uint64(bf.SrcAS))schema.ProtobufAppendVarint(bf,ColumnDstAS,uint64(bf.DstAS))schema.ProtobufAppendIP(bf,ColumnSrcAddr,bf.SrcAddr)schema.ProtobufAppendIP(bf,ColumnDstAddr,bf.DstAddr)// Add length and move it as a prefixend:=len(bf.protobuf)payloadLen:=end-maxSizeVarintbf.protobuf=protowire.AppendVarint(bf.protobuf,uint64(payloadLen))sizeLen:=len(bf.protobuf)-endresult:=bf.protobuf[maxSizeVarint-sizeLen:end]copy(result,bf.protobuf[end:end+sizeLen])returnresult
Minimizing allocations is critical for maintaining encoding performance. The
benchmark tests should be run with the -benchmem flag to monitor allocation
numbers. Each allocation incurs an additional cost to the garbage collector. The
Go profiler is a valuable tool for identifying areas of code that can be
optimized:
$ gotest-run=__nothing__-bench=Netflow/with_encoding\> -benchmem-cpuprofileprofile.out\> akvorado/inlet/flow
goos: linuxgoarch: amd64pkg: akvorado/inlet/flowcpu: AMD Ryzen 5 5600X 6-Core ProcessorNetflow/with_encoding-12 143953 7955 ns/op 8256 B/op 134 allocs/opPASSok akvorado/inlet/flow 1.418s$ gotoolpprofprofile.out
File: flow.testType: cpuTime: Feb 4, 2023 at 8:12pm (CET)Duration: 1.41s, Total samples = 2.08s (147.96%)Entering interactive mode (type "help" for commands, "o" for options)(pprof)web
After using the internal schema instead of code generated from the
proto definition file, the performance improved. However, this comparison is not
entirely fair as less information is being decoded and previously GoFlow2 was
decoding to its own structure, which was then copied to our own version.
As for testing, we use github.com/jhump/protoreflect: the
protoparse package parses the proto definition file we generate and the
dynamic package decodes the messages. Check the ProtobufDecode()
method for more details.4
To get the final figures, I have also optimized the decoding in GoFlow2. It
was relying heavily on binary.Read(). This function may use
reflection in certain cases and each call allocates a byte array to read data.
Replacing it with a more efficient version provides the following
improvement:
It is now easier to collect new data and the inlet component is faster!
Notice
Some paragraphs were editorialized by ChatGPT, using
editorialize and keep it short as a prompt. The result was proofread by a
human for correctness. The main idea is that ChatGPT should be better at
English than me.
While empty fields are not serialized to Protocol Buffers, empty
columns in ClickHouse take some space, even if they compress well.
Moreover, unused fields are still decoded and they may clutter the
interface.
There is a similar function using NetFlow. NetFlow and IPFIX
protocols are less complex to decode than sFlow as they are using a simpler
TLV structure.
vtprotobuf generates more optimized Go code by removing an
abstraction layer. It directly generates the code encoding each field to
bytes:
A few years ago, I downsized my personal infrastructure. Until 2018, there were
a dozen containers running on a single Hetzner server.1 I migrated
my emails to Fastmail and my DNS zones to Gandi. It left me with only my
blog to self-host. As of today, my low-scale infrastructure is composed of 4
virtual machines running NixOS on Hetzner Cloud and Vultr, a handful
of DNS zones on Gandi and Route 53, and a couple of Cloudfront
distributions. It is managed by CDK for Terraform (CDKTF), while NixOS
deployments are handled by NixOps.
In this article, I provide a brief introduction to Terraform, CDKTF, and the
Nix ecosystem. I also explain how to use Nix to access these tools within
your shell, so you can quickly start using them.
CDKTF: infrastructure as codeTerraform is an infrastructure-as-code tool. You can define your
infrastructure by declaring resources with the HCL language. This language
has some additional features like loops to declare several resources from a
list, built-in functions you can call in expressions, and string templates.
Terraform relies on a large set of providers to manage resources.
Managing servers
Here is a short example using the Hetzner Cloud provider to spawn a virtual
machine:
HCL expressiveness is quite limited and I find a general-purpose language more
convenient to describe all the resources. This is where CDK for Terraform
comes in: you can manage your infrastructure using your preferred programming
language, including TypeScript, Go, and Python. Here is the previous example
using CDKTF and TypeScript:
Running cdktf synth generates a configuration file for Terraform, terraform
plan previews the changes, and terraform apply applies them. Now that you
have a general-purpose language, you can use functions.
Managing DNS records
While using CDKTF for 4 web servers may seem a tad overkill, this is quite
different when it comes to managing a few DNS zones. With DNSControl, which
is using JavaScript as a domain-specific language, I was able to define the
bernat.ch zone with this snippet of code:
All the magic is in the code that I did not show you. You can check the
dns.ts file in the cdktf-take1 repository to see how it works. Here is a
quick explanation:
Route53Zone() creates a new zone hosted by Route 53,
sign() signs the zone with the provided master key,
registrar() registers the zone to the registrar of the domain and sets up DNSSEC,
www() creates A and AAAA records for the provided name pointing to the web servers,
fastmailMX() creates the MX records and other support records to direct emails to Fastmail.
Here is the content of the fastmailMX() function. It generates a few records
and returns the current zone for chaining:
I encourage you to browse the repository if you need more
information.
About Pulumi
My first tentative around Terraform was to use Pulumi. You can find this
attempt on GitHub. This is quite similar to what I currently do
with CDKTF. The main difference is that I am using Python instead of TypeScript
because I was not familiar with TypeScript at the time.2Pulumi predates CDKTF and it uses a slightly different approach. CDKTF
generates a Terraform configuration (in JSON format instead of HCL), delegating
planning, state management, and deployment to Terraform. It is therefore bound
to the limitations of what can be expressed by Terraform, notably when you
need to transform data obtained from one resource to another.3Pulumi needs specific providers for each resource. Many Pulumi providers are
thin wrappers encapsulating Terraform providers.
While Pulumi provides a good user experience, I switched to CDKTF because
writing providers for Pulumi is a chore. CDKTF does not require such a step.
Outside the big players (AWS, Azure and Google Cloud), the existence, quality,
and freshness of the Pulumi providers are inconsistent. Most providers rely on
a Terraform provider and they may lag a few versions behind, miss a few
resources, or have a few bugs of their own.
When a provider does not exist, you can write one with the help of the
pulumi-terraform-bridge library. The Pulumi project provides a
boilerplate for this purpose. I had a bad experience with it when writing
providers for Gandi and Vultr: the
Makefileautomatically installs Pulumi using a curl sh
pattern and does not work with /bin/sh. There is a lack of
interest for community-based contributions4 or even for providers for
smaller players.
NixOS & NixOpsNix is a functional, purely-functional programming language.
Nix is also the name of the package manager that is built on top of
the Nix language. It allows users to declaratively install packages.
nixpkgs is a repository of packages. You can install Nix on
top of a regular Linux distribution. If you want more details, a good resource
is the official website, and notably the learn section. There is a steep learning curve, but the reward is tremendous.
NixOS: declarative Linux distributionNixOS is a Linux distribution built on top of the Nix package manager.
Here is a configuration snippet to add some packages:
It is possible to alter an existing derivation5 to use a different version, enable a
specific feature, or apply a patch. Here is how I enable and configure Nginx
to disable the stream module, add the Brotli compression module, and add the
IP address anonymizer module. Moreover, instead of using OpenSSL 3, I keep
using OpenSSL 1.1.6
If you need to add some patches, it is also possible. Here are the patches I
added in 2019 to circumvent the DoS vulnerabilities in Nginx
until they were fixed in NixOS:7
services.nginx.package = pkgs.nginxStable.overrideAttrs (old:patches = oldAttrs.patches ++[# HTTP/2: reject zero length headers with PROTOCOL_ERROR.(pkgs.fetchpatch url =https://github.com/nginx/nginx/commit/dbdd[].patch;sha256 ="a48190[ ]"; )# HTTP/2: limited number of DATA frames.(pkgs.fetchpatch url =https://github.com/nginx/nginx/commit/94c5[].patch;sha256 ="af591a[ ]"; )# HTTP/2: limited number of PRIORITY frames.(pkgs.fetchpatch url =https://github.com/nginx/nginx/commit/39bb[].patch;sha256 ="1ad8fe[ ]"; )]; );
If you are interested, have a look at my relatively small configuration:
common.nix contains the configuration to be applied to any host
(SSH, users, common software packages), web.nix contains the
configuration for the web servers, isso.nix runs Isso into a
systemd container.
NixOps: NixOS deployment tool
On a single node, NixOS configuration is in the /etc/nixos/configuration.nix
file. After modifying it, you have to run nixos-rebuild switch. Nix fetches
all possible dependencies from the binary cache and builds the remaining
packages. It creates a new entry in the boot loader menu and activates the new
configuration.
To manage several nodes, there exists several options, including NixOps,
deploy-rs, Colmena, and morph. I do not know all of them, but from
my point of view, the differences are not that important. It is also possible to
build such a tool yourself as Nix provides the most important building blocks:
nix build and nix copy. NixOps is one of the first tools available but I
encourage you to explore the alternatives.
NixOps configuration is written in Nix. Here is a simplified configuration
to deploy znc01.luffy.cx, web01.luffy.cx, and web02.luffy.cx, with the
help of the server and web functions:
Tying everything together with Nix
The Nix ecosystem is a unified solution to the various problems around
software and configuration management. A very interesting feature is the
declarative and reproducible developer environments. This is similar to
Python virtual environments, except it is not language-specific.
Brief introduction to Nix flakes
I am using flakes, a new Nix feature improving reproducibility by pinning
all dependencies and making the build hermetic. While the feature is marked as
experimental,8 it is widely used and you may see flake.nix and
flake.lock at the root of some repositories.
As a short example, here is the flake.nix content shipped with Snimpy, an
interactive SNMP tool for Python relying on libsmi, a C library:
Nix and CDKTF
At the root of the repository I use for CDKTF, there is a
flake.nix file to set up a shell with Terraform and
CDKTF installed and with the appropriate environment variables to automate my
infrastructure.
Terraform is already packaged in nixpkgs. However, I need to apply a patch
on top of the Gandi provider. Not a problem with Nix!
CDKTF is written in TypeScript. I have a
package.json file with all the dependencies
needed, including the ones to use TypeScript as the language to define
infrastructure:
The next step is to generate the CDKTF providers from the Terraform providers
and turn them into a derivation:
cdktfProviders = pkgs.stdenvNoCC.mkDerivation name ="cdktf-providers";nativeBuildInputs =[
pkgs.nodejs
terraform
];src = nix-filter root =./.;include =[./cdktf.json./tsconfig.json]; ;buildPhase ='' export HOME=$(mktemp -d) export CHECKPOINT_DISABLE=1 export DISABLE_VERSION_CHECK=1 export PATH=$ nodeEnv/node_modules/.bin:$PATH ln -nsf $ nodeEnv/node_modules node_modules # Build all providers we have in terraform for provider in $(cd $ terraform/libexec/terraform-providers; echo */*/*/*); do version=''$ provider##*/ provider=''$ provider%/* echo "Build $provider@$version" cdktf provider add --force-local $provider@$version cat done echo "Compile TS JS" tsc '';installPhase ='' mv .gen $out ln -nsf $ nodeEnv/node_modules $out/node_modules ''; ;
Finally, we can define the development environment:
devShells.default = pkgs.mkShell name ="cdktf-take1";buildInputs =[
pkgs.nodejs
pkgs.yarn
terraform
];shellHook ='' # No telemetry export CHECKPOINT_DISABLE=1 # No autoinstall of plugins export CDKTF_DISABLE_PLUGIN_CACHE_ENV=1 # Do not check version export DISABLE_VERSION_CHECK=1 # Access to node modules export PATH=$PWD/node_modules/.bin:$PATH ln -nsf $ nodeEnv/node_modules node_modules ln -nsf $ cdktfProviders .gen # Credentials for p in \ njf.nznmba.pbz/Nqzvavfgengbe \ urgmare.pbz/ivaprag@oreang.pu \ ihyge.pbz/ihyge@ivaprag.oreang.pu; do eval $(pass show $(echo $p tr 'A-Za-z' 'N-ZA-Mn-za-m') grep '^export') done eval $(pass show personal/cdktf/secrets grep '^export') export TF_VAR_hcloudToken="$HCLOUD_TOKEN" export TF_VAR_vultrApiKey="$VULTR_API_KEY" unset VULTR_API_KEY HCLOUD_TOKEN ''; ;
The derivations listed in buildInputs are available in the provided shell.
The content of shellHook is sourced when starting the shell. It sets up some
symbolic links to make the JavaScript environment built at an earlier step
available, as well as the generated CDKTF providers. It also exports all the
credentials.10
I am also using direnv with an .envrc to
automatically load the development environment. This also enables the
environment to be available from inside Emacs, notably when using lsp-mode
to get TypeScript completions. Without direnv, nix develop . can activate
the environment.
I use the following commands to deploy the infrastructure:11
As for CDKTF, at the root of the repository I use for NixOps,
there is a flake.nix file to set up a shell with
NixOps configured. Because NixOps do not support rollouts, I usually use the
following commands to deploy on a single server:12
nixops deploy rolls out all servers in parallel and therefore could cause a
short outage where all Nginx are down at the same time.
This post has been a work-in-progress for the past three years, with the content
being updated and refined as I experimented with different solutions. There is
still much to explore13 but I feel there is enough content to publish now.
It was an AMD Athlon 64 X2 5600+ with 2 GB of RAM and 2 400 GB disks
with software RAID. I was paying something around 59 per month for it.
While it was a good deal in 2008, by 2018 it was no longer cost-effective.
It was running on Debian Wheezy with Linux-VServer for isolation, both
of which were outdated in 2018.
I also did not use Python because Poetry support in Nix was a bit
broken around the time I started hacking around CDKTF.
Pulumi can apply arbitrary functions with the apply()
method on an output. It makes it easy to transform data that are not known
during the planning stage. Terraform has functions to
serve a similar purpose, but they are more limited.
The two mentioned pull requests are not merged yet. The second one is
superseded by PR #61, submitted two months later, which
enforces the use of /bin/bash. I also submitted PR #56,
which was merged 4 months later and quickly reverted without
an explanation.
You may consider packages and derivations to be synonyms in the
Nix ecosystem.
NixOS can be a bit slow to integrate patches since they need to
rebuild parts of the binary cache before releasing the fixes. In this
specific case, they were fast: the vulnerability and patches were released
on August 13th 2019 and available in NixOS on August 15th. As a
comparison, Debian only released the fixed version on August 22nd, which is unusually late.
Because flakes are experimental, many documentations do not
use them and it is an additional aspect to learn.
It is possible to replace . with github:vincentbernat/snimpy,
like in the other commands, but having Snimpy dependencies without
Snimpy source code is less interesting.
I am using pass as a password manager. The password names are
only obfuscated to avoid spam.
The cdktf command can wrap the terraform commands, but I
prefer to use them directly as they are more flexible.
If the change is risky, I disable the server with CDKTF. This
removes it from the web service DNS records.
I would like to replace NixOps with an alternative handling
progressive rollouts and checks. I am also considering switching to
Nomad or Kubernetes to deploy workloads.
Earlier this year, we released Akvorado, a flow collector, enricher, and
visualizer. It receives network flows from your routers using either NetFlow
v9, IPFIX, or sFlow. Several pieces of information are added, like
GeoIP and interface names. The flows are exported to Apache Kafka, a
distributed queue, then stored inside ClickHouse, a column-oriented
database. A web frontend is provided to run queries. A live version is
available for you to play.
Akvorado s web frontend
Several alternatives exist:
Akvorado differentiates itself from these solutions because:
it is open source (licensed under the AGPLv3 license), and
it bundles flow collection, storage, and a web interface into a single
product.
The proposed deployment solution relies on Docker Compose to set up
Akvorado, Zookeeper, Kafka, and ClickHouse. I hope it should be enough
for anyone to get started quickly. Akvorado is performant enough to handle
100 000 flows per second with 64 GB of RAM and 24 vCPU. With 2 TB of disk, you
should expect to keep data for a few years.
I spent some time writing a fairly complete documentation. It
seems redundant to repeat its content in this blog post. There is also a section
about its internal design if you are interested in how it is built. I also
did a FRnOG presentation earlier this year, and a ClickHouse meetup
presentation, which focuses more on how ClickHouse is used. I plan to write
more detailed articles on specific aspects of Akvorado. Stay tuned!
While the collector could write directly to the database, the queue
buffers flows if the database is unavailable. It also enables you to process
flows with another piece of software (like an anti-DDoS system).
TL;DR
Never trust show commit changes diff on Cisco IOS XR.
Cisco IOS XR is the operating system running for the Cisco ASR, NCS, and
8000 routers. Compared to Cisco IOS, it features a candidate
configuration and a running configuration. In configuration mode, you can
modify the first one and issue the commit command to apply it to the running
configuration.1 This is a common concept for many NOS.
Before committing the candidate configuration to the running configuration, you
may want to check the changes that have accumulated until now. That s where the
show commit changes diff command2 comes up. Its goal is to show the
difference between the running configuration (show running-configuration) and
the candidate configuration (show configuration merge). How hard can it be?
Let s put an interface down on IOS XR 7.6.2 (released in August 2022):
The + sign before interface HundredGigE0/1/0/1 makes it look like you did
create a new interface. Maybe there was a typo? No, the diff is just broken. If
you look at the candidate configuration, everything is like you expect:
RP/0/RP0/CPU0:router(config)#show configuration merge int Hu0/1/0/1
Wed Nov 23 11:08:43.360 CETinterface HundredGigE0/1/0/1 description PNI: (some description) bundle id 4000 mode active lldp receive disable transmit disable ! shutdown load-interval 30
Here is a more problematic example on IOS XR 7.2.2 (released in January 2021).
We want to unconfigure three interfaces:
RP/0/RP0/CPU0:router(config)#no int GigabitEthernet 0/0/0/5
RP/0/RP0/CPU0:router(config)#int TenGigE 0/0/0/5 shut
RP/0/RP0/CPU0:router(config)#no int TenGigE 0/0/0/28
RP/0/RP0/CPU0:router(config)#int TenGigE 0/0/0/28 shut
RP/0/RP0/CPU0:router(config)#no int TenGigE 0/0/0/29
RP/0/RP0/CPU0:router(config)#int TenGigE 0/0/0/29 shut
RP/0/RP0/CPU0:router(config)#show commit changes diff
Mon Nov 7 15:07:22.990 CETBuilding configuration...!! IOS XR Configuration 7.2.2- interface GigabitEthernet0/0/0/5- shutdown !+ interface TenGigE0/0/0/5+ shutdown ! interface TenGigE0/0/0/28- description Trunk (some description)- bundle id 2 mode active !end
The two first commands are correctly represented by the first two chunks of the
diff: we remove GigabitEthernet0/0/0/5 and create TenGigE0/0/0/5. The two
next commands are also correctly represented by the last chunk of the diff.
TenGigE0/0/0/28 was already shut down, so it is expected that only
description and bundle id are removed. However, the diff command forgets
about the modifications for TenGigE0/0/0/29. The diff should include a chunk
similar to the last one.
RP/0/RP0/CPU0:router(config)#show run int TenGigE 0/0/0/29
Mon Nov 7 15:07:43.571 CETinterface TenGigE0/0/0/29 description Trunk to other router bundle id 2 mode active shutdown!RP/0/RP0/CPU0:router(config)#show configuration merge int TenGigE 0/0/0/29
Mon Nov 7 15:07:53.584 CETinterface TenGigE0/0/0/29 shutdown!
How can the diff be correct for TenGigE0/0/0/28 but incorrect for
TenGigE0/0/0/29 while they have the same configuration? How can you trust the
diff command if it forgets part of the configuration?
Do you remember the last time you ran an Ansible playbook and discovered the
whole router ospf block disappeared without a warning? If you use automation
tools, you should check how the diff is assembled. Automation tools should build
it from the result of show running-config and show configuration merge. This
is what NAPALM does. This is not what cisco.iosxr
collection for Ansible does.
The problem is not limited to the interface directives. You can get similar
issues for other parts of the configuration. For example, here is what we get
when removing inactive BGP neighbors on IOS XR 7.2.2:
The only correct chunk is for neighbor 217.29.66.112. All the others are missing
some of the removed lines. 217.29.67.15 is even missing all of them. How bad is
the code providing such a diff?
I could go all day with examples such as these. Cisco TAC is happy to open a
case in DDTS, their bug tracker, to fix specific occurrences of this
bug.3 However, I fail to understand why the XR team is not just
providing the diff between show run and show configuration merge. The output
would always be correct!
IOS XR has several limitations. The most inconvenient one is the
inability to change the AS number in the router bgp directive. Such a
limitation is a great pain for both operations and automation.
This command could have been just show commit, as show commit
changes diff is the only valid command you can execute from this point.
Starting from IOS XR 7.5.1, show commit changes diff precise is also a
valid command. However, I have failed to find any documentation about it and
it seems to provide the same output as show commit changes diff. That s
how clunky IOS XR can be.
See CSCwa26251 as an example of a fix for something I reported
earlier this year. You need a valid Cisco support contract to be able to see
its content.
Here are the slides I presented for FRnOG #36 in September 2022.
They are about Akvorado, a tool to collect network flows and
visualize them. It was developped by Free. I didn t get time to
publish a blog post yet, but it should happen soon!
The meetup was recorded and available on YouTube. Here is the part
relevant to my presentation, with subtitles:1
I got a few questions about how to get information from the higher
layers, like HTTP. As my use case for Akvorado was at the network
edge, my answers were mostly negative. However, as sFlow is
extensible, when collecting flows from Linux servers instead, you
could embed additional data and they could be exported as well.
I also got a question about doing aggregation in a single table.
ClickHouse can aggregate automatically data using TTL. My answer for
not doing that is partial. There is another reason: the retention
periods of the various tables may overlap. For example, the main table
keeps data for 15 days, but even in these 15 days, if I do a query on
a 12-hour window, it is faster to use the flows_1m0s aggregated
table, unless I request something about ports and IP addresses.
To generate the subtitles, I have used Amazon
Transcribe, the speech-to-text solution from Amazon AWS.
Unfortunately, there is no en-FR language available, which would
have been useful for my terrible accent. While the
subtitles were 100% accurate when the host, Robert Hodge from
Altinity, was speaking, the success rate on my talk was quite
lower. I had to rewrite almost all sentences. However, using
speech-to-text is still useful to get the timings, as it is also
something requiring a lot of work to do manually.
i3lock is a popular X11 screen lock utility. As far as
customization goes, it only allows one to set a background from a PNG
file. This limitation is part of the design of i3lock: its primary
goal is to keep the screen locked, something difficult enough with
X11. Each additional feature would increase the attack surface and
move away from this goal.1 Many are frustrated with these
limitations and extend i3lock through simple wrapper scripts or by
forking it.2 The first solution is usually safe, but the
second goes against the spirit of i3lock.
XSecureLock is a less-known alternative to i3lock. One of the
most attractive features of this locker is to delegate the screen
saver feature to another process. This process can be anything as long
it can attach to an existing window provided by XSecureLock, which
won t pass any input to it. It will also put a black window below it
to ensure the screen stays locked in case of a crash.
XSecureLock is shipped with a few screen savers, notably one using
mpv to display photos or videos, like the Apple TV aerial
videos. I have written my own saver using Python and GTK.3
It shows a background image, a clock, and the current
weather.4
Custom screen saver for XSecureLock
I add two patches over XSecureLock:
Sleep before mapping screen saver window. This patch prevents a
flash of black when starting XSecureLock by waiting a bit for the
screen saver to be ready before displaying it. As I am also using a
custom dimmer fading to the expected background before locking,
the flash of black was quite annoying for me. I have good hope this
patch will be accepted upstream.
Do not mess with DPMS/blanking. This patch prevents
XSecureLock from blanking the screen. I think this is solely the
role of the X11 DPMS extension. This makes the code simpler. I am
unsure if this patch would be accepted by upstream.
XSecureLock also delegates the authentication window to another
process, but I was less comfortable providing a custom one as it is a
bit more security-sensitive. While basic, the shipped authentication
application is fine by me.
I think people should avoid modifying i3lock code and use
XSecureLock instead. I hope this post will help a bit.
This Reddit post enumerates many of these alternatives.
Using GTK makes it a bit difficult to use some low-level
features, like embedding an application into an existing window.
However, the high-level features are easier, notably drawing an
image and a text with a shadow.
Weather is retrieved by another script running on a
timer and written to a file. The screen saver watches this file
for updates.
If your workstation is using full-disk encryption, you may want to
jump directly to your desktop environment after entering the
passphrase to decrypt the disk. Many display managers like GDM and
LightDM have an autologin feature. However, only GDM can run
Xorg with standard user privileges.
Here is an alternative using startx and a systemd service:
The unit starts after systemd-user-sessions.service, which enables
user logins after boot by removing the /run/nologin file.
With User=bernat, the unit is started with the identity of the
specified user. This implies that Xorg does not run with elevated
privileges.
With PAMName=login, the executed process is registered as a PAM
session for the login service, which includes pam_systemd.
This module registers the session to the systemd login manager.
To be effective, we also need to allocate a TTY with
TTYPath=/dev/tty8. When the TTY is active, the user is granted
additional access to local devices notably display, sound, keyboard,
mouse. These additional rights are needed to get Xorg working
rootless.1 The TERM environment variable is unset because
it would be set to linux by systemd as a result of attaching the
standard input to the TTY. Moreover, we inform pam_systemd we want
an X11 session with Environment=XDG_SESSION_TYPE=x11. Otherwise,
logind considers the session idle unless it receives input on
the TTY. Software relying on the idle hint from logind would be
ineffective.2
The UtmpIdentifier=tty8 and UtmpMode=user directives are just
a nice addition to register the session in /var/run/utmp.
The last step is to execute Xorg through startx. For logind to
allow Xorg to take control of the local devices, chvt 8 switches
to the allocated TTY.3StandardOutput=journal, combined
with the -verbose 3 -logfile /dev/null flags for Xorg, puts the
logs from the X server into the journal instead of using a file.
While equal to the default value, the Restart=no directive
highlights we do not want this unit to be restarted. This ensures
the loginless session is only available on boot. By default,
startx runs xinitrc. If you want to run Kodi instead, add
/usr/bin/kodi-standalone between startx and --.
Drop this unit in /etc/systemd/system/x11-autologin.service and
enable it with systemctl enable x11-autologin.service. Xorg is now
running rootless and logging into the journal. After using it for a
few months, I didn t notice any regression compared to LightDM with
autologin.
For more information on how logind provides access to
devices, see this blog post. The method names do not match the
current implementation, but the concepts are still correct. Xorg
takes control of the session when the TTY is active.
Xorg could change the type of the session itself after
taking control of it, but it does not.
There is some code in Xorg to do that, but it is
executed too late and fails with: xf86OpenConsole: VT_ACTIVATE
failed: Operation not permitted.
The first step when automating a network is to build the source of
truth. A source of truth is a repository of data that provides the
intended state: the list of devices, the IP addresses, the network
protocols settings, the time servers, etc. A popular choice is
NetBox. Its documentation highlights its usage as a source of
truth:
NetBox intends to represent the desired state of a network versus
its operational state. As such, automated import of live network
state is strongly discouraged. All data created in NetBox should
first be vetted by a human to ensure its integrity. NetBox can
then be used to populate monitoring and provisioning systems with a
high degree of confidence.
When introducing Jerikan, a common feedback we got was: you
should use NetBox for this. Indeed, Jerikan s source of truth is
a bunch of YAML files versioned with Git.
Why Git?
If we look at how things are done with servers and services, in a
datacenter or in the cloud, we are likely to find users of
Terraform, a tool turning declarative configuration files into
infrastructure. Declarative configuration management tools like
Salt, Puppet,1 or Ansible take care of server
configuration. NixOS is an alternative: it combines package
management and configuration management with a functional language to
build virtual machines and containers. When using a Kubernetes
cluster, people use Kustomize or Helm, two other declarative
configuration management tools. Tapped together, these tools implement
the infrastructure as code paradigm.
Infrastructure as code is an approach to infrastructure automation
based on practices from software development. It emphasizes
consistent, repeatable routines for provisioning and changing
systems and their configuration. You make changes to code, then use
automation to test and apply those changes to your systems.
Kief Morris, Infrastructure as Code, O Reilly.
A version control system is a central tool for infrastructure as code.
The usual candidate is Git with a source code management system like
GitLab or GitHub. You get:
Traceability and visibility
Git keeps a log of all changes: what, who, why, and when. With a
bit of discipline, each change is explained and self-contained. It
becomes part of the infrastructure documentation. When the support
team complains about a degraded experience for some customers over
the last two months or so, you quickly discover this may be related
to a change to an incoming policy in New York.
Rolling back
If a change is defective, it can be reverted quickly, safely, and
without much effort, even if other changes happened in the meantime.
The policy change at the origin of the problem spanned over three
routers. Reverting this specific change and deploying the
configuration let you solve the situation until you find a better
fix.
Branching, reviewing, merging
When working on a new feature or refactoring some part of the
infrastructure, a team member creates a branch and works on their
change without interfering with the work of other members. Once the
branch is ready, a pull request is created and the change is ready
to be reviewed by the other team members before merging. You
discover the issue was related to diverting traffic through an IX
where one ISP was connected without enough capacity. You propose and
discuss a fix that includes a change of the schema and the templates
used to declare policies to be able to handle this case.
Continuous integration
For each change, automated tests are triggered. They can detect
problems and give more details on the effect of a change. Branches
can be deployed to a test infrastructure where regression tests are
executed. The results can be synthesized as a comment in the pull
request to help the review. You check your proposed change does not
modify the other existing policies.
Why not NetBox?NetBox does not share these features. It is a database with a REST
and a GraphQL API. Traceability is limited: changes are not grouped
into a transaction and they are not documented. You cannot fork the
database. Usually, there is one staging database to test modifications
before applying them to the production database. It does not scale
well and reviews are difficult. Applying the same change to the
production database can be hazardous. Rolling back a change is
non-trivial.
Update (2021-11)
Nautobot, a fork of NetBox, will soon
address this point by using Dolt, an SQL database engine allowing
you to clone, branch, and merge, like a Git repository. Dolt is
compatible with MySQL clients. See Nautobots, Roll Back! for a
preview of this feature.
Moreover, NetBox is not usually the single source of truth. It
contains your hardware inventory, the IP addresses, and some topology
information. However, this is not the place you put authorized SSH
keys, syslog servers, or the BGP configuration. If you also use
Ansible, this information ends in its inventory. The source of
truth is therefore fragmented between several tools with different
workflows. Since NetBox 2.7, you can append additional data with
configuration contexts. This mitigates this point. The data is
arranged hierarchically but the hierarchy cannot be
customized.2Nautobot can manage configuration contexts in
a Git repository, while still allowing to use of the API to fetch
them. You get some additional perks, thanks to Git, but the
remaining data is still in a database with a different lifecycle.
Lastly, the schema used by NetBox may not fit your needs and you
cannot tweak it. For example, you may have a rule to compute the IPv6
address from the IPv4 address for dual-stack interfaces. Such a
relationship cannot be easily expressed and enforced in NetBox. When
changing the IPv4 address, you may forget the IPv6 address. The source
of truth should only contain the IPv4 address but you also want the
IPv6 address in NetBox because this is your IPAM and you need it to
update your DNS entries.
Why not Git?
There are some limitations when putting your source of truth in Git:
If you want to expose a web interface to allow an external team to
request a change, it is more difficult to do it with Git than
with a database. Out-of-the-box, NetBox provides a nice web
interface and a permission system. You can also write your own web
interface and interact with NetBox through its API.
YAML files are more difficult to query in different ways. For
example, looking for a free IP address is complex if they are
scattered in multiple places.
In my opinion, in most cases, you are better off putting the source of
truth in Git instead of NetBox. You get a lot of perks by doing
that and you can still use NetBox as a read-only view, usable by
other tools. We do that with an Ansible module. In the remaining
cases, Git could still fit the bill. Read-only access control can be
done through submodules. Pull requests can restrict write access: a
bot can check the changes only modify allowed files before
auto-merging. This still requires some Git knowledge, but many teams
are now comfortable using Git, thanks to its ubiquity.
Wikimedia manages its infrastructure with Puppet. They
publish everything on GitHub. Creative Commons uses
Salt. They also publish everything on GitHub.
Thanks to them for doing that! I wish I could provide more
real-life examples.
Being able to customize the hierarchy is key to avoiding
repetition in the data. For example, if switches are paired
together, some data should be attached to them as a group and not
duplicated on each of them. Tags can be used to partially work
around this issue but you lose the hierarchical aspect.
scp -3 can copy files between two remote hosts through localhost.
This comes in handy when the two servers cannot communicate
directly or if they are unable to authenticate one to the
other.1 Unfortunately, rsync does not support such a feature.
Here is a trick to emulate the behavior of scp -3 with SSH tunnels.
When syncing with a remote host, rsync invokes ssh to spawn a
remote rsync --server process. It interacts with it through its
standard input and output. The idea is to recreate the same setup
using SSH tunnels and socat, a versatile tool to establish
bidirectional data transfers.
The first step is to connect to the source server and ask rsync the
command-line to spawn the remote rsync --server process. The -e
flag overrides the command to use to get a remote shell: instead of
ssh, we use echo.
$ ssh web04
$ rsync -e 'sh -c ">&2 echo $@" echo' -aLv /data/. web05:/data/.
web05 rsync --server -vlogDtpre.iLsfxCIvu . /data/.rsync: connection unexpectedly closed (0 bytes received so far) [sender]rsync error: error in rsync protocol data stream (code 12) at io.c(228) [sender=3.2.3]
The second step is to connect to the destination server with local
port forwarding. When connecting to the local port 5000, the TCP
connection is forwarded through SSH to the remote port 5000 and
handled by socat. When receiving the connection, socat spawns the
rsync --server command we got at the previous step and connects its
standard input and output to the incoming TCP socket.
The last step is to connect to the source with remote port
forwarding. socat is used in place of a regular SSH connection and
connects its standard input and output to a TCP socket connected to the
remote port 5000. Thanks to the remote port forwarding, SSH forwards
the data to the local port 5000. From there, it is relayed back to the
destination, as described in the previous step.
This little diagram may help understand how everything fits together:
How each process is connected together. Arrows labeled stdio are implemented as two pipes connecting the process to the left to the standard input and output of the process to the right. Don't be fooled by the apparent symmetry!
The rsync manual page prohibits the use of --server. Use this
hack at your own risk!
The options --server and --sender are used internally by rsync,
and should never be typed by a user under normal circumstances. Some
awareness of these options may be needed in certain scenarios, such
as when setting up a login that can only run an rsync command. For
instance, the support directory of the rsync distribution has an
example script named rrsync (for restricted rsync) that can be
used with a restricted ssh login.
Addendum
I was hoping to get something similar with a one-liner. But this does
not work!
$ socat \> exec:"ssh web04 rsync --server --sender -vlLogDtpre.iLsfxCIvu . /data/."\> exec:"ssh web05 rsync --server -vlogDtpre.iLsfxCIvu /data/. /data/."\over-long vstring received (511 > 255)over-long vstring received (511 > 255)rsync error: requested action not supported (code 4) at compat.c(387) [sender=3.2.3]rsync error: requested action not supported (code 4) at compat.c(387) [Receiver=3.2.3]socat[878291] E waitpid(): child 878292 exited with status 4socat[878291] E waitpid(): child 878293 exited with status 4