This is my report from the Netfilter Workshop 2022. The event was held on 2022-10-20/2022-10-21 in Seville, and the venue
was the offices of
Zevenet. We started on Thursday with Pablo Neira (head of the project) giving a short
welcome / opening speech. The previous iteration of this event was in
virtual fashion in 2020, two years ago.
In the year 2021 we were unable to meet either in person or online.
This year, the number of participants was just eight people, and this allowed the setup to be a bit more informal.
We had kind of an un-conference style meeting, in which whoever had something prepared just went ahead and opened
a topic for debate.
In the opening speech, Pablo did a quick recap on the legal problems the Netfilter project had a few years ago, a
topic that was
settled for good some months ago, in January 2022. There were no news in this front,
which was definitely a good thing.
Moving into the technical topics, the workshop proper, Pablo started to comment on the recent developments to
instrument a way to perform inner matching for tunnel protocols. The current implementation supports VXLAN, IPIP,
GRE and GENEVE. Using nftables you can match packet headers that are encapsulated inside these protocols.
He mentioned the design and the goals, that was to have a kernel space setup that allows adding more protocols by just
patching userspace. In that sense, more tunnel protocols will be supported soon, such as IP6IP, UDP, and ESP.
Pablo requested our opinion on whether if nftables should generate the matching dependencies. For example,
if a given tunnel is UDP-based, a dependency match should be there otherwise the rule won t work as expected. The
agreement was to assist the user in the setup when possible and if not, print clear error messages.
By the way, this inner thing is pure stateless packet filtering. Doing inner-conntracking is an open topic
that will be worked on in the future.
Pablo continued with the next topic: nftables automatic ruleset optimizations. The times of linear ruleset evaluation
are over, but some people have a hard time understanding / creating rulesets that leverage maps, sets, and
concatenations. This is where the ruleset optimizations kick in: it can transform a given ruleset to be more optimal
by using such advanced data structures. This is purely about optimizing the ruleset, not about validating
the usefulness of it, which could be another interesting project.
There were a couple of problems mentioned, however. The ruleset optimizer can be slow, O(n!) in worst case. And the
user needs to use nested syntax. More improvements to come in the future.
Next was Stefano Brivio s turn (Red Hat engineer). He had been involved lately in a couple of migrations to
nftables, in particular libvirt and KubeVirt. We were pointed to
https://libvirt.org/firewall.html, and Stefano walked us
through the 3 or 4 different virtual networks that libvirt can create. He evaluated some options to generate efficient
rulesets in nftables to instrument such networks, and commented on a couple of ideas: having a null
matcher in nftables set expression. Or perhaps having kind of subsets, something similar to a view in a SQL
database. The room spent quite a bit of time debating how the nft_lookup API could be extended to support such new
search operations.
We also discussed if having intermediate facilities such as
firewalld could provide the abstraction levels that
could make developers more comfortable. Using firewalld also may have the advantage that coordination between different
system components writing ruleset to nftables is handled by firewalld itself and developers are freed of the
responsibility of doing it right.
Next was Fernando F. Mancera (Red Hat engineer). He wanted to improve error reporting when deleting table/chain/rules
with nftables. In general, there are some inconsistencies on how tables can be deleted (or flushed). And there seems
to be no correct way to make a single table go away with all its content in a single command.
The room agreed in that the commands
destroy table
and
delete table
should be defined consistently, with
the following meanings:
- destroy: nuke the table, don t fail if it doesn t exist
- delete: delete the table, but the command will fail if it doesn t exist
This topic diverted into another: how to reload/replace a ruleset but keep stateful information (such as counters).
Next was Phil Sutter (Netfilter coreteam member and Red Hat engineer). He was interested in discussing options to
make iptables-nft backward compatible. The use case he brought was simple: What happens if a container running
iptables 1.8.7 creates a ruleset with features not supported by 1.8.6. A later container running 1.8.6 may fail to
operate.
Phil s first approach was to attach additional metadata into rules to assist older iptables-nft in decoding and
printing the ruleset. But in general, there are no obvious or easy solutions to this problem. Some people are
mixing different tooling version, and there is no way all cases can be predicted/covered. iptables-nft already
refuses to work in some of the most basic failure scenarios.
An other way to approach the issue could be to introduce some kind of support to print raw expressions in
iptables-nft, like
-m nft xyz
. Which feels ugly, but may work. We also explored playing with the semantics of
release version numbers. And another idea: store strings in the nft rule userdata area with the equivalent
matching information for older iptables-nft.
In fact, what Phil may have been looking for is not backwards but forward compatibility. Phil was undecided which path
to follow, but perhaps the most common-sense approach is to fall back to a major release version bump (2.x.y)
and declaring compatibility breakage with older iptables 1.x.y.
That was pretty much it for the first day. We had dinner together and went to sleep for the next day.
The second day was opened by Florian Westphal (Netfilter coreteam member and Red Hat engineer). Florian has been
trying to improve nftables performance in kernels with RETPOLINE mitigations enabled. He commented that several
workarounds have been collected over the years to avoid the performance penalty of such mitigations.
The basic strategy is to avoid function indirect calls in the kernel.
Florian also described how BPF programs work around this more effectively. And actually, Florian tried translating
nf_hook_slow()
to BPF. Some preliminary benchmarks results were showed, with about 2% performance improvement in
MB/s and PPS. The
flowtable infrastructure is specially benefited from this approach. The software
flowtable infrastructure already offers a 5x performance improvement with regards the classic forwarding path, and the
change being researched by Florian would be an addition on top of that.
We then moved into discussing
the meeting Florian had with Alexei in Zurich. My personal opinion was that
Netfilter offers interesting user-facing interfaces and semantics that BPF does not. Whereas BPF may be more performant
in certain scenarios. The idea of both things going hand in hand may feel natural for some people. Others also
shared my view, but no particular agreement was reached in this topic. Florian will probably continue exploring options
on that front.
The next topic was opened by Fernando. He wanted to discuss Netfilter involvement in Google Summer of Code and Outreachy.
Pablo had some personal stuff going on last year that prevented him from engaging in such projects. After all, GSoC is
not fundamental or a priority for Netfilter. Also, Pablo mentioned the lack of support from others in the project for
mentoring activities. There was no particular decision made here. Netfilter may be present again in such initiatives
in the future, perhaps under the umbrella of other organizations.
Again, Fernando proposed the next topic: nftables JSON support. Fernando shared his plan of going over all features
and introduce programmatic tests from them. He also mentioned that the nftables wiki was incomplete and couldn t be
used as a reference for missing tests. Phil suggested running the nftables python test-suite in JSON mode, which
should complain about missing features. The py test suite should cover pretty much all statements and variations on
how the nftables expression are invoked.
Next, Phil commented on nftables xtables support. This is, supporting legacy xtables extensions in nftables.
The most prominent problem was that some translations had some corner cases that resulted in a listed ruleset that
couldn t be fed back into the kernel. Also, iptables-to-nftables translations can be sloppy, and the resulting
rule won t work in some cases. In general,
nft list ruleset nft -f
may fail in rulesets created by iptables-nft
and there is no trivial way to solve it.
Phil also commented on potential iptables-tests.py speed-ups. Running the test suite may take very long time
depending on the hardware. Phil will try to re-architect it, so it runs faster. Some alternatives had been
explored, including collecting all rules into a single iptables-restore run, instead of hundreds of individual
iptables calls.
Next topic was about documentation on the
nftables wiki. Phil is interested in having all nftables
code-flows documented, and presented some improvements in that front. We are trying to organize all
developer-oriented docs on a mediawiki portal, but the extension was not active yet. Since I worked at the
Wikimedia Foundation, all the room stared at me, so at the end I kind of committed to exploring and enabling the
mediawiki portal extension. Note to self: is this perhaps
https://www.mediawiki.org/wiki/Portals ?
Next presentation was by Pablo. He had a list of assorted topics for quick review and comment.
- We discussed nftables accept/drop semantics. People that gets two or more rulesets from different software are
requesting additional semantics here. A typical case is fail2ban integration. One option is quick accept (no further
evaluation if accepted) and the other is lazy drop (don t actually drop the packet, but delay decision until the
whole ruleset has been evaluated). There was no clear way to move forward with this.
- A debate on nft userspace memory usage followed. Some people are running nftables on low end devices with very
little memory (such as 128 MB). Pablo was exploring a potential solution: introducing
struct constant_expr
, which
can reduce 12.5% mem usage.
- Next we talked about repository licensing (or better, relicensing to GPLv2+). Pablo went over a list of files in the
nftables tree which had diverging licenses. All people in the room agreed on this relicensing effort. A mention to
the libreadline situation was made.
- Another quick topic: a bogus EEXIST in nft_rbtree. Pablo & Stefano to work in a patch.
- Next one was conntrack early drop in flowtable. Pablo is studying use cases for some legitimate UDP unidirectional
flows (like RTP traffic).
- Pablo and Stefano discussed pipapo not being atomic on updates. Stefano already looked into it, and one of the ideas
was to introduce a new commit API for sets.
- The last of the quick topics was an idea to have a global table in nftables. Or some global items, like sets. Folk
in the community keep asking for this. Some ideas were discussed, like perhaps adding a family agnostic family. But
then there would be a challenge: nftables would need to generate byte code that works in any of the hooks.
There was no immediate way of addressing this. The idea of having templated tables/sets circulated again as a way
of reusing data across namespaces/families.
Following this, a new topic was introduced by Stefano. He wanted to talk about nft_set_pipapo, documentation, what
to do next, etc. He did a nice explanation of how the pipapo algorithm works for element inserts, lookups, and
deletion. The source code
is pretty well documented, by the way. He showed performance measurements of
different data types being stored in the structure. After some lengthly debate on how to introduce changes without
breaking usage for users, he declared some action items: writing more docs, addressing problems with non-atomic
set reloads and a potential rework of nft_rbtree.
After that, the next topic was kubernetes & netfilter , also by Stefano. Actually, this topic was very similar
to what we already discussed regarding libvirt. Developers want to reduce packet matching effort, but also often
don t leverage nftables most performant features, like sets, maps or concatenations.
Some Red Hat developers are already working on replacing everything with native nftables & firewalld integrations.
But some rules generators are very bad. Kubernetes (kube-proxy) is a known case. Developers simply won t learn how
to code better ruleset generators. There was a good question floating around: What are people missing on first
encounter with nftables?
The Netfilter project doesn t have a training or marketing department or something like that. We cannot
force-educate developers on how to use nftables in the right way. Perhaps we need to create a set of dedicated
guidelines, or best practices, in the wiki for app developers that rely on nftables. Jozsef Kadlecsik
(Netfilter coreteam) supported this idea, and suggested going beyond: such documents should be written exclusively
from the nftables point of view: stop approaching the docs as a comparison to the old iptables semantics.
Related to that last topic, next was Laura Garc a (Zevenet engineer, and venue host). She shared the same information
as she presented in the Kubernetes network SIG in August 2020. She walked us through
nftlb and
kube-nftlb, a proof-of-concept replacement for kube-proxy based on nftlb that can outperform it.
For whatever reason, kube-nftlb wasn t adopted by the upstream kubernetes community.
She also covered latest changes to nftlb and some missing features, such as integration with nftables egress.
nftlb is being extended to be a full proxy service and a more robust overall solution for service abstractions.
In a nutshell, nftlb uses a templated ruleset and only adds elements to sets, which is exactly the right usage
of the nftables framework. Some other projects should follow its example. The performance numbers are impressive,
and from the early days it was clear that it was
outperforming classical LVS-DSR by 10x.
I used this opportunity to bring a topic that I wanted to discuss. I ve seen some SRE coworkers talking about
katran as a replacement for traditional LVS setups. This software is a XDP/BPF based solution for load
balancing. I was puzzled about what this software had to offer versus, for example, nftlb or any other
nftables-based solutions. I commented on the highlighs of katran, and we discussed the nftables equivalents.
nftlb is a simple daemon which does everything using a JSON-enabled REST API. It is already
packaged into Debian, ready to use, whereas katran feels more like a collection of steps that you
need to run in a certain order to get it working. All the hashing, caching, HA without state sharing, and backend
weight selection features of katran are already present in nftlb.
To work on a pure L3/ToR datacenter network setting, katran uses IPIP encapsulation. They can t just mangle the
MAC address as in traditional DSR because the backend server is on a different L3 domain. It turns out nftables
has a
nft_tunnel
expression that can do this encapsulation for complete feature parity. It is only available in
the kernel, but it can be made available easily on the userspace utility too.
Also, we discussed some limitations of katran, for example, inability to handle IP fragmentation, IP options, and
potentially others not documented anywhere. This seems to be common with XDP/BPF programs, because handling all
possible network scenarios would over-complicate the BPF programs, and at that point you are probably better off by
using the normal Linux network stack and nftables.
In summary, we agreed that nftlb can pretty much offer the same as katran, in a more flexible way.
Finally, after many interesting debates over two days, the workshop ended. We all agreed on the need for extending
it to 3 days next time, since 2 days feel too intense and too short for all the topics worth discussing.
That s all on my side! I really enjoyed this Netfilter workshop round.