Antoine Beaupr : New approaches to network fast paths
With the speed of network hardware now reaching 100 Gbps and distributed
denial-of-service (DDoS) attacks going in the Tbps
range, Linux kernel
developers are scrambling to optimize key network paths in the kernel to
keep up. Many efforts are actually geared toward getting traffic out
of the costly Linux TCP stack. We have already
covered the XDP (eXpress Data Path)
patch set, but two new ideas surfaced during the Netconf and Netdev
conferences held in Toronto and Montreal in early April 2017. One is a
patch set called af_packet, which aims at extracting raw packets from
the kernel as fast as possible; the other is the idea of implementing
in-kernel layer-7 proxying. There are also user-space network stacks
like Netmap,
DPDK, or Snabb (which we previously
covered).
This article aims at clarifying what all those components do and to
provide a short status update for the tools we have already covered. We
will focus on in-kernel solutions for now. Indeed, user-space tools have
a fundamental limitation: if they need to re-inject packets onto the
network, they must again pay the expensive cost of crossing the kernel
barrier. User-space performance is effectively bounded by that
fundamental design. So we'll focus on kernel solutions here. We will
start from the lowest part of the stack, the af_packet patch set, and
work our way up the stack all the way up to layer-7 and in-kernel
proxying.
af_packet v4
John Fastabend presented a new version of a patch set that was first
published in
January
regarding the af_packet protocol family, which is currently used by
af_packet v4
John Fastabend presented a new version of a patch set that was first
published in
January
regarding the af_packet protocol family, which is currently used by
tcpdump
to extract packets from network interfaces. The goal of this
change is to allow zero-copy transfers between user-space applications
and the NIC (network interface card) transmit and receive ring buffers.
Such optimizations are useful for telecommunications companies, which
may use it for deep packet
inspection or
running exotic protocols in user space. Another use case is running a
high-performance intrusion detection
system that
needs to watch large traffic streams in realtime to catch certain types
of attacks.
Fastabend presented his work during the Netdev network-performance
workshop, but also brought the patch set up for discussion during
Netconf. There, he said he could achieve line-rate extraction (and
injection) of packets, with packet rates as high as 30Mpps. This
performance gain is possible because user-space pages are directly
DMA-mapped to the NIC, which is also a security concern. The other
downside of this approach is that a complete pair of ring buffers needs
to be dedicated for this purpose; whereas before packets were copied to
user space, now they are memory-mapped, so the user-space side needs to
process those packets quickly otherwise they are simply dropped.
Furthermore, it's an "all or nothing" approach; while NIC-level
classifiers could be used to steer part of the traffic to a specific
queue, once traffic hits that queue, it is only accessible through the
af_packet interface and not the rest of the regular stack. If done
correctly, however, this could actually improve the way user-space
stacks access those packets, providing projects like DPDK a safer way to
share pages with the NIC, because it is well defined and
kernel-controlled. According to Jesper Dangaard Brouer (during review of
this article):
This proposal will be a safer way to share raw packet data between
user space and kernel space than what DPDK is doing, [by providing]
a cleaner separation as we keep driver code in the kernel where it
belongs.
During the Netdev network-performance workshop, Fastabend asked if there
was a better data structure to use for such a purpose. The goal here is
to provide a consistent interface to user space regardless of the driver
or hardware used to extract packets from the wire. af_packet currently
defines its own packet format that abstracts away the NIC-specific
details, but there are other possible formats. For example, someone in
the audience proposed the virtio packet format. Alexei Starovoitov
rejected this idea because af_packet is a kernel-specific facility
while virtio has its own separate specification
with its own requirements.
The next step for af_packet is the posting of the new "v4" patch set,
although Miller warned that this wouldn't get merged until proper XDP
support lands in the Intel drivers. The concern, of course, is that the
kernel would have multiple incomplete bypass solutions available at
once. Hopefully, Fastabend will present the (by then) merged patch set
at the next Netdev conference in November.
XDP updates
Higher up in the networking stack sits XDP. The af_packet feature
differs from XDP in that it does not perform any sort of analysis or
mangling of packets; its objective is purely to get the data into and
out of the kernel as fast as possible, completely bypassing the regular
kernel networking stack. XDP also sits before the networking stack
except that, according to Brouer, it is "focused on cooperating with
the existing network stack infrastructure, and on use-cases where the
packet doesn't necessarily need to leave kernel space (like routing and
bridging, or skipping complex code-paths)."
XDP has evolved quite a bit since we last covered it in LWN. It seems
that most of the controversy surrounding the introduction of XDP in the
Linux kernel has died down in public discussions, under the leadership
of David Miller, who heralded XDP as the right solution for a long-term
architecture in the kernel. He presented XDP as a fast, flexible, and
safe solution.
Indeed, one of the controversies surrounding XDP was the question of the
inherent security challenges with introducing user-provided programs
directly into the Linux kernel to mangle packets at such a low level.
Miller argued that whatever protections are expected for user-space
programs also apply to XDP programs, comparing the virtual memory
protections to the eBPF (extended BPF) verifier applied to XDP programs.
Those programs are actually eBPF that have an interesting set of
restrictions:
- they have a limited size
- they cannot jump backward (and thus cannot loop), so they execute in
predictable time
- they do only static allocation, so they are also limited in memory
XDP is not a one-size-fits-all solution: netfilter, the TC traffic
shaper, and other normal Linux utilities still have their place. There
is, however, a clear use case for a solution like XDP in the kernel.
For example, Facebook and Cloudflare have both started testing XDP and,
in Facebook's case, deploying XDP in production. Martin Kafai Lau, from
Facebook, presented the tool set the company is using to construct a
DDoS-resilience solution and a level-4 load balancer (L4LB), which got a
ten-times performance improvement over the previous
IPVS-based
solution. Facebook rolled out its own user-space solution called
"Droplet" to detect hostile traffic and deploy blocking rules in the
form of eBPF programs loaded in XDP. Lau demonstrated the way Facebook
deploys a three-part chained eBPF program: the first part allows
debugging and dumping of packets, the second is Droplet itself, which
drops undesirable traffic, and the last segment is the load balancer,
which mangles the packets to tweak their destination according to
internal rules. Droplet can drop DDoS attacks at line rate while keeping
the architecture flexible, which were two key design requirements.
Gilberto Bertin, from Cloudflare, presented a similar approach:
Cloudflare has a tool that processes
sFlow data generated from
iptables
in order to generate cBPF (classic BPF) mitigation rules that
are then deployed on edge routers. Those rules are created with a tool
called bpfgen
, part of Cloudflare's BSD-licensed
bpftools suite. For example,
it could create a cBPF bytecode blob that would match DNS queries to any
example.com
domain with something like:
bpfgen dns *.example.com
Originally, Cloudflare would deploy those rules to plain iptables
firewalls with the xt_bpf
module, but this led to performance issues.
It then deployed a proprietary user-space solution based on
Solarflare hardware, but this has the
performance limitations of user-space applications getting packets
back onto the wire involves the cost of re-injecting packets back into
the kernel. This is why Cloudflare is experimenting with XDP, which was
partly developed in response to the company's problems, to deploy those
BPF programs.
A concern that Bertin identified was the lack of visibility into dropped
packets. Cloudflare currently samples some of the dropped traffic to
analyze attacks; this is not currently possible with XDP unless you pass
the packets down the stack, which is expensive. Miller agreed that the
lack of monitoring for XDP programs is a large issue that needs to be
resolved, and suggested creating a way to mark packets for extraction to
allow analysis. Cloudflare is currently in a testing phase with XDP and
it is unclear if its whole XDP tool chain will be publicly available.
While those two companies are starting to use XDP as-is, there is more
work needed to complete the XDP project. As mentioned above and in our
previous coverage, massive statistics
extraction is still limited in the Linux kernel and introspection is
difficult. Furthermore, while the existing
actions (XDP_DROP
and XDP_TX
, see the
documentation
for more information) are well implemented and used, another action may
be introduced, called XDP_REDIRECT
, which would allow redirecting
packets to different network interfaces. Such an action could also be
used to accelerate bridges as packets could be "switched" based on the
MAC address table. XDP also requires network driver support, which is
currently limited. For example, the Intel drivers still do not support
XDP, although that should come pretty soon.
Miller, in his Netdev keynote, focused on XDP and presented it as the
standard solution that is safe, fast, and usable. He identified the next
steps of XDP development to be the addition of debugging mechanisms,
better sampling tools for statistics and analysis, and user-space
consistency. Miller foresees a future for XDP similar to the
popularization of the Arduino chips: a simple set of tools that anyone,
not just developers, can use. He gave the example of an Arduino tutorial
that he followed where he could just look up a part number and get
easy-to-use instructions on how to program it. Similar components should
be available for XDP. For this purpose, the conference saw the creation
of a new mailing list called
xdp-newbies where
people can learn how to create XDP build environments and how to write
XDP programs.
In-kernel layer-7 proxying
The third approach that struck me as innovative is the idea of doing
layer-7 (application) proxying directly in the kernel. This comes from
the idea that, traditionally, we build firewalls to segregate traffic
and apply controls, but as most services move to HTTP, those policies
become ineffective.
Thomas Graf, presented this idea during Netconf using a Star Wars
allegory: what if the Death Star were a server with an API? You would
have endpoints like /dock
or /comms
that would allow you to dock a
ship or communicate with the Death Star. Those API endpoints should
obviously be public, but then there is this /exhaust-port
endpoint
that should never be publicly available. In order for a firewall to
protect such a system, it must be able to inspect traffic at a higher
level than the traditional address-port pairs. Graf presented a design
where the kernel would create an in-kernel socket that would negotiate
TCP connections on behalf of user space and then be able to apply
arbitrary eBPF rules in the kernel.
In this scenario, instead of doing the traditional transfer from
Netfilter's TPROXY to user space, the kernel directly decapsulates the
HTTP traffic and passes it to BPF rules that can make decisions without
doing expensive context switches or memory copies in the case of simply
wanting to refuse traffic (e.g. issue an HTTP 403 error). This, of
course, requires the inclusion of kTLS to process
HTTPS connections. HTTP2 support may also prove problematic, as it
multiplexes connections and is harder to decapsulate. This design was
described as a "pure pre-accept()
hook". Starovoitov also compared the
design to the kernel connection multiplexer (KCM).
Tom Herbert, KCM's author, agreed that it could be extended to support
this, but would require some extensions in user space to provide an
interface between regular socket-based applications and the KCM layer.
In any case, if the application does TLS (and lots of them do), kTLS
gets tricky because it breaks the end-to-end nature of TLS, in effect
becoming a man in the middle between the client and the application.
Eric Dumazet argued that HA-Proxy already
does things like this: it uses splice()
to avoid copying too much data
around, but it still does a context switch to hand over processing to
user space, something that could be fixed in the general case.
Another similar project that was
presented at
Netdev is the Tempesta
firewall and reverse-proxy. The speaker, Alex Krizhanovsky, explained
the Tempesta developers have taken one person month to port the mbed
TLS stack to the Linux kernel to allow an
in-kernel TLS handshake. Tempesta also implements rate limiting,
cookies, and JavaScript challenges to mitigate DDoS attacks. The
argument behind the project is that "it's easier to move TLS to the
kernel than it is to move the TCP/IP stack to user space". Graf
explained that he is familiar with Krizhanovsky's work and he is hoping
to collaborate. In effect, the design Graf is working on would serve as
a foundation for Krizhanovsky's in-kernel HTTP server (kHTTP). In a
private email, Graf explained that:
The main differences in the implementation are currently that we
foresee to use BPF for protocol parsing to avoid having to implement
every single application protocol natively in the kernel. Tempesta
likely sees this less of an issue as they are probably only targeting
HTTP/1.1 and HTTP/2 and to some [extent] JavaScript.
Neither project is really ready for production yet. There didn't seem to
be any significant pushback from key network developers against the
idea, which surprised some people, so it is likely we will see more and
more layer-7 intelligence move into the kernel sooner rather than later.
Conclusion
All of this work aims at replacing a rag-tag bunch of proprietary
solutions that recently came up to bypass the Linux kernel TCP/IP stack
and improve performance for firewalls, proxies, and other key edge
network elements. The idea is that, unless the kernel improves its
performance, or at least provides a way to bypass its more complex code
paths, people will work around it. With this set of solutions in place,
engineers will now be able to use standard APIs to hook high-performance
systems into the Linux kernel.
The author would like to thank the Netdev and Netconf organizers for
travel assistance, Thomas Graf for a review of the in-kernel
proxying section of this article, and Jesper Dangaard Brouer for
review of the af_packet and XDP sections.
Note: this article first appeared in
the Linux Weekly News.
- they have a limited size
- they cannot jump backward (and thus cannot loop), so they execute in predictable time
- they do only static allocation, so they are also limited in memory
iptables
in order to generate cBPF (classic BPF) mitigation rules that
are then deployed on edge routers. Those rules are created with a tool
called bpfgen
, part of Cloudflare's BSD-licensed
bpftools suite. For example,
it could create a cBPF bytecode blob that would match DNS queries to any
example.com
domain with something like:
bpfgen dns *.example.com
Originally, Cloudflare would deploy those rules to plain iptables
firewalls with the xt_bpf
module, but this led to performance issues.
It then deployed a proprietary user-space solution based on
Solarflare hardware, but this has the
performance limitations of user-space applications getting packets
back onto the wire involves the cost of re-injecting packets back into
the kernel. This is why Cloudflare is experimenting with XDP, which was
partly developed in response to the company's problems, to deploy those
BPF programs.
A concern that Bertin identified was the lack of visibility into dropped
packets. Cloudflare currently samples some of the dropped traffic to
analyze attacks; this is not currently possible with XDP unless you pass
the packets down the stack, which is expensive. Miller agreed that the
lack of monitoring for XDP programs is a large issue that needs to be
resolved, and suggested creating a way to mark packets for extraction to
allow analysis. Cloudflare is currently in a testing phase with XDP and
it is unclear if its whole XDP tool chain will be publicly available.
While those two companies are starting to use XDP as-is, there is more
work needed to complete the XDP project. As mentioned above and in our
previous coverage, massive statistics
extraction is still limited in the Linux kernel and introspection is
difficult. Furthermore, while the existing
actions (XDP_DROP
and XDP_TX
, see the
documentation
for more information) are well implemented and used, another action may
be introduced, called XDP_REDIRECT
, which would allow redirecting
packets to different network interfaces. Such an action could also be
used to accelerate bridges as packets could be "switched" based on the
MAC address table. XDP also requires network driver support, which is
currently limited. For example, the Intel drivers still do not support
XDP, although that should come pretty soon.
Miller, in his Netdev keynote, focused on XDP and presented it as the
standard solution that is safe, fast, and usable. He identified the next
steps of XDP development to be the addition of debugging mechanisms,
better sampling tools for statistics and analysis, and user-space
consistency. Miller foresees a future for XDP similar to the
popularization of the Arduino chips: a simple set of tools that anyone,
not just developers, can use. He gave the example of an Arduino tutorial
that he followed where he could just look up a part number and get
easy-to-use instructions on how to program it. Similar components should
be available for XDP. For this purpose, the conference saw the creation
of a new mailing list called
xdp-newbies where
people can learn how to create XDP build environments and how to write
XDP programs.
In-kernel layer-7 proxying
The third approach that struck me as innovative is the idea of doing
layer-7 (application) proxying directly in the kernel. This comes from
the idea that, traditionally, we build firewalls to segregate traffic
and apply controls, but as most services move to HTTP, those policies
become ineffective.
Thomas Graf, presented this idea during Netconf using a Star Wars
allegory: what if the Death Star were a server with an API? You would
have endpoints like /dock
or /comms
that would allow you to dock a
ship or communicate with the Death Star. Those API endpoints should
obviously be public, but then there is this /exhaust-port
endpoint
that should never be publicly available. In order for a firewall to
protect such a system, it must be able to inspect traffic at a higher
level than the traditional address-port pairs. Graf presented a design
where the kernel would create an in-kernel socket that would negotiate
TCP connections on behalf of user space and then be able to apply
arbitrary eBPF rules in the kernel.
In this scenario, instead of doing the traditional transfer from
Netfilter's TPROXY to user space, the kernel directly decapsulates the
HTTP traffic and passes it to BPF rules that can make decisions without
doing expensive context switches or memory copies in the case of simply
wanting to refuse traffic (e.g. issue an HTTP 403 error). This, of
course, requires the inclusion of kTLS to process
HTTPS connections. HTTP2 support may also prove problematic, as it
multiplexes connections and is harder to decapsulate. This design was
described as a "pure pre-accept()
hook". Starovoitov also compared the
design to the kernel connection multiplexer (KCM).
Tom Herbert, KCM's author, agreed that it could be extended to support
this, but would require some extensions in user space to provide an
interface between regular socket-based applications and the KCM layer.
In any case, if the application does TLS (and lots of them do), kTLS
gets tricky because it breaks the end-to-end nature of TLS, in effect
becoming a man in the middle between the client and the application.
Eric Dumazet argued that HA-Proxy already
does things like this: it uses splice()
to avoid copying too much data
around, but it still does a context switch to hand over processing to
user space, something that could be fixed in the general case.
Another similar project that was
presented at
Netdev is the Tempesta
firewall and reverse-proxy. The speaker, Alex Krizhanovsky, explained
the Tempesta developers have taken one person month to port the mbed
TLS stack to the Linux kernel to allow an
in-kernel TLS handshake. Tempesta also implements rate limiting,
cookies, and JavaScript challenges to mitigate DDoS attacks. The
argument behind the project is that "it's easier to move TLS to the
kernel than it is to move the TCP/IP stack to user space". Graf
explained that he is familiar with Krizhanovsky's work and he is hoping
to collaborate. In effect, the design Graf is working on would serve as
a foundation for Krizhanovsky's in-kernel HTTP server (kHTTP). In a
private email, Graf explained that:
The main differences in the implementation are currently that we
foresee to use BPF for protocol parsing to avoid having to implement
every single application protocol natively in the kernel. Tempesta
likely sees this less of an issue as they are probably only targeting
HTTP/1.1 and HTTP/2 and to some [extent] JavaScript.
Neither project is really ready for production yet. There didn't seem to
be any significant pushback from key network developers against the
idea, which surprised some people, so it is likely we will see more and
more layer-7 intelligence move into the kernel sooner rather than later.
Conclusion
All of this work aims at replacing a rag-tag bunch of proprietary
solutions that recently came up to bypass the Linux kernel TCP/IP stack
and improve performance for firewalls, proxies, and other key edge
network elements. The idea is that, unless the kernel improves its
performance, or at least provides a way to bypass its more complex code
paths, people will work around it. With this set of solutions in place,
engineers will now be able to use standard APIs to hook high-performance
systems into the Linux kernel.
The author would like to thank the Netdev and Netconf organizers for
travel assistance, Thomas Graf for a review of the in-kernel
proxying section of this article, and Jesper Dangaard Brouer for
review of the af_packet and XDP sections.
Note: this article first appeared in
the Linux Weekly News.
The author would like to thank the Netdev and Netconf organizers for travel assistance, Thomas Graf for a review of the in-kernel proxying section of this article, and Jesper Dangaard Brouer for review of the af_packet and XDP sections. Note: this article first appeared in the Linux Weekly News.