Search Results: "tina"

9 October 2017

Gunnar Wolf: Achievement unlocked - Made with Creative Commons translated to Spanish! (Thanks, @xattack!)

I am very, very, very happy to report this And I cannot believe we have achieved this so fast: Back in June, I announced I'd start working on the translation of the Made with Creative Commons book into Spanish. Over the following few weeks, I worked out the most viable infrastructure, gathered input and commitments for help from a couple of friends, submitted my project for inclusion in the Hosted Weblate translations site (and got it approved!) Then, we quietly and slowly started working. Then, as it usually happens in late August, early September... The rush of the semester caught me in full, and I left this translation project for later For the next semester, perhaps... Today, I received a mail that surprised me. That stunned me. 99% of translated strings! Of course, it does not look as neat as "100%" would, but there are several strings not to be translated. So, yay for collaborative work! Oh, and FWIW Thanks to everybody who helped. And really, really, really, hats off to Luis Enrique Amaya, a friend whom I see way less than I should. A LIDSOL graduate, and a nice guy all around. Why to him specially? Well... This has several wrinkles to iron out, but, by number of translated lines: ...Need I say more? Luis, I hope you enjoyed reading the book :-] There is still a lot of work to do, and I'm asking the rest of the team some days so I can get my act together. From the mail I just sent, I need to:
  1. Review the Pandoc conversion process, to get the strings formatted again into a book; I had got this working somewhere in the process, but last I checked it broke. I expect this not to be too much of a hurdle, and it will help all other translations.
  2. Start the editorial process at my Institute. Once the book builds, I'll have to start again the stylistic correction process so the Institute agrees to print it out under its seal. This time, we have the hurdle that our correctors will probably hate us due to part of the work being done before we had actually agreed on some important Spanish language issues... which are different between Mexico, Argentina and Costa Rica (where translators are from). Anyway This sets the mood for a great start of the week. Yay!
AttachmentSize
Screenshot from 2017-10-08 20-55-30.png103.1 KB

8 October 2017

Daniel Pocock: A step change in managing your calendar, without social media

Have you been to an event recently involving free software or a related topic? How did you find it? Are you organizing an event and don't want to fall into the trap of using Facebook or Meetup or other services that compete for a share of your community's attention? Are you keen to find events in foreign destinations related to your interest areas to coincide with other travel intentions? Have you been concerned when your GSoC or Outreachy interns lost a week of their project going through the bureaucracy to get a visa for your community's event? Would you like to make it easier for them to find the best events in the countries that welcome and respect visitors? In many recent discussions about free software activism, people have struggled to break out of the illusion that social media is the way to cultivate new contacts. Wouldn't it be great to make more meaningful contacts by attending more a more diverse range of events rather than losing time on social media? Making it happen There are already a number of tools (for example, Drupal plugins and Wordpress plugins) for promoting your events on the web and in iCalendar format. There are also a number of sites like Agenda du Libre and GriCal who aggregate events from multiple communities where people can browse them. How can we take these concepts further and make a convenient, compelling and global solution? Can we harvest event data from a wide range of sources and compile it into a large database using something like PostgreSQL or a NoSQL solution or even a distributed solution like OpenDHT? Can we use big data techniques to mine these datasources and help match people to events without compromising on privacy? Why not build an automated iCalendar "to-do" list of deadlines for events you want to be reminded about, so you never miss the deadlines for travel sponsorship or submitting a talk proposal? I've started documenting an architecture for this on the Debian wiki and proposed it as an Outreachy project. It will also be offered as part of GSoC in 2018. Ways to get involved If you would like to help this project, please consider introducing yourself on the debian-outreach mailing list and helping to mentor or refer interns for the project. You can also help contribute ideas for the specification through the mailing list or wiki. Mini DebConf Prishtina 2017 This weekend I've been at the MiniDebConf in Prishtina, Kosovo. It has been hosted by the amazing Prishtina hackerspace community. Watch out for future events in Prishtina, the pizzas are huge, but that didn't stop them disappearing before we finished the photos:

13 September 2017

Vincent Bernat: Route-based IPsec VPN on Linux with strongSwan

A common way to establish an IPsec tunnel on Linux is to use an IKE daemon, like the one from the strongSwan project, with a minimal configuration1:
conn V2-1
  left        = 2001:db8:1::1
  leftsubnet  = 2001:db8:a1::/64
  right       = 2001:db8:2::1
  rightsubnet = 2001:db8:a2::/64
  authby      = psk
  auto        = route
The same configuration can be used on both sides. Each side will figure out if it is left or right . The IPsec site-to-site tunnel endpoints are 2001:db8: 1::1 and 2001:db8: 2::1. The protected subnets are 2001:db8: a1::/64 and 2001:db8: a2::/64. As a result, strongSwan configures the following policies in the kernel:
$ ip xfrm policy
src 2001:db8:a1::/64 dst 2001:db8:a2::/64
        dir out priority 399999 ptype main
        tmpl src 2001:db8:1::1 dst 2001:db8:2::1
                proto esp reqid 4 mode tunnel
src 2001:db8:a2::/64 dst 2001:db8:a1::/64
        dir fwd priority 399999 ptype main
        tmpl src 2001:db8:2::1 dst 2001:db8:1::1
                proto esp reqid 4 mode tunnel
src 2001:db8:a2::/64 dst 2001:db8:a1::/64
        dir in priority 399999 ptype main
        tmpl src 2001:db8:2::1 dst 2001:db8:1::1
                proto esp reqid 4 mode tunnel
[ ]
This kind of IPsec tunnel is a policy-based VPN: encapsulation and decapsulation are governed by these policies. Each of them contains the following elements: When a matching policy is found, the kernel will look for a corresponding security association (using reqid and the endpoint source and destination addresses):
$ ip xfrm state
src 2001:db8:1::1 dst 2001:db8:2::1
        proto esp spi 0xc1890b6e reqid 4 mode tunnel
        replay-window 0 flag af-unspec
        auth-trunc hmac(sha256) 0x5b68[ ]8ba2904 128
        enc cbc(aes) 0x8e0e377ad8fd91e8553648340ff0fa06
        anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000
[ ]
If no security association is found, the packet is put on hold and the IKE daemon is asked to negotiate an appropriate one. Otherwise, the packet is encapsulated. The receiving end identifies the appropriate security association using the SPI in the header. Two security associations are needed to establish a bidirectionnal tunnel:
$ tcpdump -pni eth0 -c2 -s0 esp
13:07:30.871150 IP6 2001:db8:1::1 > 2001:db8:2::1: ESP(spi=0xc1890b6e,seq=0x222)
13:07:30.872297 IP6 2001:db8:2::1 > 2001:db8:1::1: ESP(spi=0xcf2426b6,seq=0x204)
All IPsec implementations are compatible with policy-based VPNs. However, some configurations are difficult to implement. For example, consider the following proposition for redundant site-to-site VPNs: Redundant VPNs between 3 sites A possible configuration between V1-1 and V2-1 could be:
conn V1-1-to-V2-1
  left        = 2001:db8:1::1
  leftsubnet  = 2001:db8:a1::/64,2001:db8:a6::cc:1/128,2001:db8:a6::cc:5/128
  right       = 2001:db8:2::1
  rightsubnet = 2001:db8:a2::/64,2001:db8:a6::/64,2001:db8:a8::/64
  authby      = psk
  keyexchange = ikev2
  auto        = route
Each time a subnet is modified on one site, the configurations need to be updated on all sites. Moreover, overlapping subnets (2001:db8: a6::/64 on one side and 2001:db8: a6::cc:1/128 at the other) can also be problematic. The alternative is to use route-based VPNs: any packet traversing a pseudo-interface will be encapsulated using a security policy bound to the interface. This brings two features:
  1. Routing daemons can be used to distribute routes to be protected by the VPN. This decreases the administrative burden when many subnets are present on each side.
  2. Encapsulation and decapsulation can be executed in a different routing instance or namespace. This enables a clean separation between a private routing instance (where VPN users are) and a public routing instance (where VPN endpoints are).

Route-based VPN on Juniper Before looking at how to achieve that on Linux, let s have a look at the way it works with a JunOS-based platform (like a Juniper vSRX). This platform as long-standing history of supporting route-based VPNs (a feature already present in the Netscreen ISG platform). Let s assume we want to configure the IPsec VPN from V3-2 to V1-1. First, we need to configure the tunnel interface and bind it to the private routing instance containing only internal routes (with IPv4, they would have been RFC 1918 routes):
interfaces  
    st0  
        unit 1  
            family inet6  
                address 2001:db8:ff::7/127;
             
         
     
 
routing-instances  
    private  
        instance-type virtual-router;
        interface st0.1;
     
 
The second step is to configure the VPN:
security  
    /* Phase 1 configuration */
    ike  
        proposal IKE-P1  
            authentication-method pre-shared-keys;
            dh-group group20;
            encryption-algorithm aes-256-gcm;
         
        policy IKE-V1-1  
            mode main;
            proposals IKE-P1;
            pre-shared-key ascii-text "d8bdRxaY22oH1j89Z2nATeYyrXfP9ga6xC5mi0RG1uc";
         
        gateway GW-V1-1  
            ike-policy IKE-V1-1;
            address 2001:db8:1::1;
            external-interface lo0.1;
            general-ikeid;
            version v2-only;
         
     
    /* Phase 2 configuration */
    ipsec  
        proposal ESP-P2  
            protocol esp;
            encryption-algorithm aes-256-gcm;
         
        policy IPSEC-V1-1  
            perfect-forward-secrecy keys group20;
            proposals ESP-P2;
         
        vpn VPN-V1-1  
            bind-interface st0.1;
            df-bit copy;
            ike  
                gateway GW-V1-1;
                ipsec-policy IPSEC-V1-1;
             
            establish-tunnels on-traffic;
         
     
 
We get a route-based VPN because we bind the st0.1 interface to the VPN-V1-1 VPN. Once the VPN is up, any packet entering st0.1 will be encapsulated and sent to the 2001:db8: 1::1 endpoint. The last step is to configure BGP in the private routing instance to exchange routes with the remote site:
routing-instances  
    private  
        routing-options  
            router-id 1.0.3.2;
            maximum-paths 16;
         
        protocols  
            bgp  
                preference 140;
                log-updown;
                group v4-VPN  
                    type external;
                    local-as 65003;
                    hold-time 6;
                    neighbor 2001:db8:ff::6 peer-as 65001;
                    multipath;
                    export [ NEXT-HOP-SELF OUR-ROUTES NOTHING ];
                 
             
         
     
 
The export filter OUR-ROUTES needs to select the routes to be advertised to the other peers. For example:
policy-options  
    policy-statement OUR-ROUTES  
        term 10  
            from  
                protocol ospf3;
                route-type internal;
             
            then  
                metric 0;
                accept;
             
         
     
 
The configuration needs to be repeated for the other peers. The complete version is available on GitHub. Once the BGP sessions are up, we start learning routes from the other sites. For example, here is the route for 2001:db8: a1::/64:
> show route 2001:db8:a1::/64 protocol bgp table private.inet6.0 best-path
private.inet6.0: 15 destinations, 19 routes (15 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
2001:db8:a1::/64   *[BGP/140] 01:12:32, localpref 100, from 2001:db8:ff::6
                      AS path: 65001 I, validation-state: unverified
                      to 2001:db8:ff::6 via st0.1
                    > to 2001:db8:ff::14 via st0.2
It was learnt both from V1-1 (through st0.1) and V1-2 (through st0.2). The route is part of the private routing instance but encapsulated packets are sent/received in the public routing instance. No route-leaking is needed for this configuration. The VPN cannot be used as a gateway from internal hosts to external hosts (or vice-versa). This could also have been done with JunOS security policies (stateful firewall rules) but doing the separation with routing instances also ensure routes from different domains are not mixed and a simple policy misconfiguration won t lead to a disaster.

Route-based VPN on Linux Starting from Linux 3.15, a similar configuration is possible with the help of a virtual tunnel interface3. First, we create the private namespace:
# ip netns add private
# ip netns exec private sysctl -qw net.ipv6.conf.all.forwarding=1
Any private interface needs to be moved to this namespace (no IP is configured as we can use IPv6 link-local addresses):
# ip link set netns private dev eth1
# ip link set netns private dev eth2
# ip netns exec private ip link set up dev eth1
# ip netns exec private ip link set up dev eth2
Then, we create vti6, a tunnel interface (similar to st0.1 in the JunOS example):
# ip tunnel add vti6 \
   mode vti6 \
   local 2001:db8:1::1 \
   remote 2001:db8:3::2 \
   key 6
# ip link set netns private dev vti6
# ip netns exec private ip addr add 2001:db8:ff::6/127 dev vti6
# ip netns exec private sysctl -qw net.ipv4.conf.vti6.disable_policy=1
# ip netns exec private sysctl -qw net.ipv4.conf.vti6.disable_xfrm=1
# ip netns exec private ip link set vti6 mtu 1500
# ip netns exec private ip link set vti6 up
The tunnel interface is created in the initial namespace and moved to the private one. It will remember its original namespace where it will process encapsulated packets. Any packet entering the interface will temporarily get a firewall mark of 6 that will be used only to match the appropriate IPsec policy4 below. The kernel sets a low MTU on the interface to handle any possible combination of ciphers and protocols. We set it to 1500 and let PMTUD do its work. We can then configure strongSwan5:
conn V3-2
  left        = 2001:db8:1::1
  leftsubnet  = ::/0
  right       = 2001:db8:3::2
  rightsubnet = ::/0
  authby      = psk
  mark        = 6
  auto        = route
  keyexchange = ikev2
  keyingtries = %forever
  ike         = aes256gcm16-prfsha384-ecp384!
  esp         = aes256gcm16-prfsha384-ecp384!
  mobike      = no
The IKE daemon configures the following policies in the kernel:
$ ip xfrm policy
src ::/0 dst ::/0
        dir out priority 399999 ptype main
        mark 0x6/0xffffffff
        tmpl src 2001:db8:1::1 dst 2001:db8:3::2
                proto esp reqid 1 mode tunnel
src ::/0 dst ::/0
        dir fwd priority 399999 ptype main
        mark 0x6/0xffffffff
        tmpl src 2001:db8:3::2 dst 2001:db8:1::1
                proto esp reqid 1 mode tunnel
src ::/0 dst ::/0
        dir in priority 399999 ptype main
        mark 0x6/0xffffffff
        tmpl src 2001:db8:3::2 dst 2001:db8:1::1
                proto esp reqid 1 mode tunnel
[ ]
Those policies are used for any source or destination as long as the firewall mark is equal to 6, which matches the mark configured for the tunnel interface. The last step is to configure BGP to exchange routes. We can use BIRD for this:
router id 1.0.1.1;
protocol device  
   scan time 10;
 
protocol kernel  
   persist;
   learn;
   import all;
   export all;
   merge paths yes;
 
protocol bgp IBGP_V3_2  
   local 2001:db8:ff::6 as 65001;
   neighbor 2001:db8:ff::7 as 65003;
   import all;
   export where ifname ~ "eth*";
   preference 160;
   hold time 6;
 
Once BIRD is started in the private namespace, we can check routes are learned correctly:
$ ip netns exec private ip -6 route show 2001:db8:a3::/64
2001:db8:a3::/64 proto bird metric 1024
        nexthop via 2001:db8:ff::5  dev vti5 weight 1
        nexthop via 2001:db8:ff::7  dev vti6 weight 1
The above route was learnt from both V3-1 (through vti5) and V3-2 (through vti6). Like for the JunOS version, there is no route-leaking between the private namespace and the initial one. The VPN cannot be used as a gateway between the two namespaces, only for encapsulation. This also prevent a misconfiguration (for example, IKE daemon not running) from allowing packets to leave the private network. As a bonus, unencrypted traffic can be observed with tcpdump on the tunnel interface:
$ ip netns exec private tcpdump -pni vti6 icmp6
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vti6, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
20:51:15.258708 IP6 2001:db8:a1::1 > 2001:db8:a3::1: ICMP6, echo request, seq 69
20:51:15.260874 IP6 2001:db8:a3::1 > 2001:db8:a1::1: ICMP6, echo reply, seq 69
You can find all the configuration files for this example on GitHub. The documentation of strongSwan also features a page about route-based VPNs.

  1. Everything in this post should work with Libreswan.
  2. fwd is for incoming packets on non-local addresses. It only makes sense in transport mode and is a Linux-only particularity.
  3. Virtual tunnel interfaces (VTI) were introduced in Linux 3.6 (for IPv4) and Linux 3.12 (for IPv6). Appropriate namespace support was added in 3.15. KLIPS, an alternative out-of-tree stack available since Linux 2.2, also features tunnel interfaces.
  4. The mark is set right before doing a policy lookup and restored after that. Consequently, it doesn t affect other possible uses (filtering, routing). However, as Netfilter can also set a mark, one should be careful for conflicts.
  5. The ciphers used here are the strongest ones currently possible while keeping compatibility with JunOS. The documentation for strongSwan contains a complete list of supported algorithms as well as security recommendations to choose them.

12 September 2017

Markus Koschany: My Free Software Activities in August 2017

Welcome to gambaru.de. Here is my monthly report that covers what I have been doing for Debian. If you re interested in Java, Games and LTS topics, this might be interesting for you. DebConf 17 in Montreal I traveled to DebConf 17 in Montreal/Canada. I arrived on 04. August and met a lot of different people which I only knew by name so far. I think this is definitely one of the best aspects of real life meetings, putting names to faces and getting to know someone better. I totally enjoyed my stay and I would like to thank all the people who were involved in organizing this event. You rock! I also gave a talk about the The past, present and future of Debian Games , listened to numerous other talks and got a nice sunburn which luckily turned into a more brownish color when I returned home on 12. August. The only negative experience I made was with my airline which was supposed to fly me home to Frankfurt again. They decided to cancel the flight one hour before check-in for unknown reasons and just gave me a telephone number to sort things out. No support whatsoever. Fortunately (probably not for him) another DebConf attendee suffered the same fate and together we could find another flight with Royal Air Maroc the same day. And so we made a short trip to Casablanca/Morocco and eventually arrived at our final destination in Frankfurt a few hours later. So which airline should you avoid at all costs (they still haven t responded to my refund claims) ? It s WoW-Air from Iceland. (just wow) Debian Games Debian Java Debian LTS This was my eighteenth month as a paid contributor and I have been paid to work 20,25 hours on Debian LTS, a project started by Rapha l Hertzog. In that time I did the following: Non-maintainer upload Thanks for reading and see you next time.

5 September 2017

Kees Cook: security things in Linux v4.13

Previously: v4.12. Here s a short summary of some of interesting security things in Sunday s v4.13 release of the Linux kernel: security documentation ReSTification
The kernel has been switching to formatting documentation with ReST, and I noticed that none of the Documentation/security/ tree had been converted yet. I took the opportunity to take a few passes at formatting the existing documentation and, at Jon Corbet s recommendation, split it up between end-user documentation (which is mainly how to use LSMs) and developer documentation (which is mainly how to use various internal APIs). A bunch of these docs need some updating, so maybe with the improved visibility, they ll get some extra attention. CONFIG_REFCOUNT_FULL
Since Peter Zijlstra implemented the refcount_t API in v4.11, Elena Reshetova (with Hans Liljestrand and David Windsor) has been systematically replacing atomic_t reference counters with refcount_t. As of v4.13, there are now close to 125 conversions with many more to come. However, there were concerns over the performance characteristics of the refcount_t implementation from the maintainers of the net, mm, and block subsystems. In order to assuage these concerns and help the conversion progress continue, I added an unchecked refcount_t implementation (identical to the earlier atomic_t implementation) as the default, with the fully checked implementation now available under CONFIG_REFCOUNT_FULL. The plan is that for v4.14 and beyond, the kernel can grow per-architecture implementations of refcount_t that have performance characteristics on par with atomic_t (as done in grsecurity s PAX_REFCOUNT). CONFIG_FORTIFY_SOURCE
Daniel Micay created a version of glibc s FORTIFY_SOURCE compile-time and run-time protection for finding overflows in the common string (e.g. strcpy, strcmp) and memory (e.g. memcpy, memcmp) functions. The idea is that since the compiler already knows the size of many of the buffer arguments used by these functions, it can already build in checks for buffer overflows. When all the sizes are known at compile time, this can actually allow the compiler to fail the build instead of continuing with a proven overflow. When only some of the sizes are known (e.g. destination size is known at compile-time, but source size is only known at run-time) run-time checks are added to catch any cases where an overflow might happen. Adding this found several places where minor leaks were happening, and Daniel and I chased down fixes for them. One interesting note about this protection is that is only examines the size of the whole object for its size (via __builtin_object_size(..., 0)). If you have a string within a structure, CONFIG_FORTIFY_SOURCE as currently implemented will make sure only that you can t copy beyond the structure (but therefore, you can still overflow the string within the structure). The next step in enhancing this protection is to switch from 0 (above) to 1, which will use the closest surrounding subobject (e.g. the string). However, there are a lot of cases where the kernel intentionally copies across multiple structure fields, which means more fixes before this higher level can be enabled. NULL-prefixed stack canary
Rik van Riel and Daniel Micay changed how the stack canary is defined on 64-bit systems to always make sure that the leading byte is zero. This provides a deterministic defense against overflowing string functions (e.g. strcpy), since they will either stop an overflowing read at the NULL byte, or be unable to write a NULL byte, thereby always triggering the canary check. This does reduce the entropy from 64 bits to 56 bits for overflow cases where NULL bytes can be written (e.g. memcpy), but the trade-off is worth it. (Besdies, x86_64 s canary was 32-bits until recently.) IPC refactoring
Partially in support of allowing IPC structure layouts to be randomized by the randstruct plugin, Manfred Spraul and I reorganized the internal layout of how IPC is tracked in the kernel. The resulting allocations are smaller and much easier to deal with, even if I initially missed a few needed container_of() uses. randstruct gcc plugin
I ported grsecurity s clever randstruct gcc plugin to upstream. This plugin allows structure layouts to be randomized on a per-build basis, providing a probabilistic defense against attacks that need to know the location of sensitive structure fields in kernel memory (which is most attacks). By moving things around in this fashion, attackers need to perform much more work to determine the resulting layout before they can mount a reliable attack. Unfortunately, due to the timing of the development cycle, only the manual mode of randstruct landed in upstream (i.e. marking structures with __randomize_layout). v4.14 will also have the automatic mode enabled, which randomizes all structures that contain only function pointers. A large number of fixes to support randstruct have been landing from v4.10 through v4.13, most of which were already identified and fixed by grsecurity, but many were novel, either in newly added drivers, as whitelisted cross-structure casts, refactorings (like IPC noted above), or in a corner case on ARM found during upstream testing. lower ELF_ET_DYN_BASE
One of the issues identified from the Stack Clash set of vulnerabilities was that it was possible to collide stack memory with the highest portion of a PIE program s text memory since the default ELF_ET_DYN_BASE (the lowest possible random position of a PIE executable in memory) was already so high in the memory layout (specifically, 2/3rds of the way through the address space). Fixing this required teaching the ELF loader how to load interpreters as shared objects in the mmap region instead of as a PIE executable (to avoid potentially colliding with the binary it was loading). As a result, the PIE default could be moved down to ET_EXEC (0x400000) on 32-bit, entirely avoiding the subset of Stack Clash attacks. 64-bit could be moved to just above the 32-bit address space (0x100000000), leaving the entire 32-bit region open for VMs to do 32-bit addressing, but late in the cycle it was discovered that Address Sanitizer couldn t handle it moving. With most of the Stack Clash risk only applicable to 32-bit, fixing 64-bit has been deferred until there is a way to teach Address Sanitizer how to load itself as a shared object instead of as a PIE binary. early device randomness
I noticed that early device randomness wasn t actually getting added to the kernel entropy pools, so I fixed that to improve the effectiveness of the latent_entropy gcc plugin. That s it for now; please let me know if I missed anything. As a side note, I was rather alarmed to discover that due to all my trivial ReSTification formatting, and tiny FORTIFY_SOURCE and randstruct fixes, I made it into the most active 4.13 developers list (by patch count) at LWN with 76 patches: a whopping 0.6% of the cycle s patches. ;) Anyway, the v4.14 merge window is open!

2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

1 September 2017

Russ Allbery: Review: Regeneration

Review: Regeneration, by Julie E. Czerneda
Series: Species Imperative #3
Publisher: DAW
Copyright: 2006
ISBN: 0-7564-0345-6
Format: Hardcover
Pages: 543
This is the third book of the Species Imperative trilogy, and this is the type of trilogy that's telling a single story in three books. You don't want to read this out of order, and I'll have to be cautious about aspects of the plot to not spoil the earlier books. Mac is still recovering from the effects of the first two books of the series, but she's primarily worried about a deeply injured friend. Worse, that friend is struggling to explain or process what's happened, and the gaps in her memory and her very ability to explain may point at frightening, lingering risks to humanity. As much as she wants to, Mac can't give her friend all of her focus, since she's also integral to the team trying to understand the broader implications of the events of Migration. Worse, some of the non-human species have their own contrary interpretations that, if acted on, Mac believes would be desperately risky for humanity and all the other species reachable through the transects. That set of competing priorities and motivations eventually sort themselves out into a tense and rewarding multi-species story, but they get off to an awkward start. The first 150 pages of Regeneration are long on worry, uncertainty, dread, and cryptic conversations, and short on enjoyable reading. Czerneda's recaps of the previous books are appreciated, but they weren't very smoothly integrated into the story. (I renew my occasional request for series authors to include a simple plot summary of the previous books as a prefix, without trying to weave it into the fiction.) I was looking forward to this book after the excellent previous volumes, but struggled to get into the story. That does change. It takes a bit too long, with a bit too much nameless dread, a bit too much of an irritating subplot between Fourteen and Oversight that I didn't think added anything to the book, and not enough of Mac barreling forward doing sensible things. But once Mac gets back into space, with a destination and a job and a collection of suspicious (or arrogant) humans and almost-incomprehensible aliens to juggle, Czerneda hits her stride. Czerneda doesn't entirely avoid Planet of the Hats problems with her aliens, but I think she does better than most of science fiction. Alien species in this series do tend to be a bit all of a type, and Mac does figure them out by drawing conclusions from biology, but those conclusions are unobvious and based on Mac's mix of biological and human social intuition. They refreshingly aren't as simple as biology completely shaping culture. (Czerneda's touch is more subtle than James White's Sector General, for example.) And Mac has a practical, determined, and selfless approach that's deeply likable and admirable. It's fun as a reader to watch her win people over by just being competent, thoughtful, observant, and unrelentingly ethical. But the best part of this book, by far, are the Sinzi. They first appeared in the second book, Migration, and seemed to follow the common SF trope of a wise elder alien race that can bring some order to the universe and that humanity can learn from. They, or more precisely the one Sinzi who appeared in Migration, was very good at that role. But Czerneda had something far more interesting planned, and in Regeneration they become truly alien in their own right, with their own nearly incomprehensible way of viewing the universe. There are so many ways that this twist can go wrong, and Czerneda avoids all of them. She doesn't undermine their gravitas, nor does she elevate them to the level of Arisians or other semi-angelic wise mentors of other series. Czerneda makes them different in profound ways that are both advantage and disadvantage, pulls that difference into the plot as a complicating element, and has Mac stumble yet again into a role that is accidentally far more influential than she intends. Mac is the perfect character to do that to: she has just the right mix of embarrassment, ethics, seat-of-the-pants blunt negotiation skills, and a strong moral compass. Given a lever and a place to stand, one can believe that Mac can move the world, and the Sinzi are an absolutely fascinating lever. There are also three separate, highly differentiated Sinzi in this story, with different goals, life experience, personalities, and levels of gravitas. Czerneda's aliens are good in general, but her focus is usually more on biology than individual differentiation. The Sinzi here combine the best of both types of character building. I think the ending of Regeneration didn't entirely work. After all the intense effort the characters put into understanding the complexity of the universe over the course of the series, the denouement has a mopping-up feel and a moral clarity that felt a bit too easy. But the climax has everything I was hoping for, there's a lot more of Mac being Mac, and I loved every moment of the Sinzi twist. Now I want a whole new series exploring the implications of the Sinzi's view of the universe on the whole history of galactic politics that sat underneath this story. But I'll settle for moments of revelation that sent shivers down my spine. This is a bit of an uneven book that falls short of its potential, but I'll remember it for a long time. Add it on to a deeply rewarding series, and I will recommend the whole package unreservedly. The Species Imperative is excellent science fiction that should be better-known than it is. I still think the romance subplot was unfortunate, and occasionally the aliens get too cartoony (Fourteen, in particular, goes a bit too far in that direction), but Czerneda never lingers too long on those elements. And the whole work is some of the best writing about working scientific research and small-group politics that I've read. Highly recommended, but read the whole series in order. Rating: 9 out of 10

20 August 2017

Vincent Bernat: IPv6 route lookup on Linux

TL;DR: With its implementation of IPv6 routing tables using radix trees, Linux offers subpar performance (450 ns for a full view 40,000 routes) compared to IPv4 (50 ns for a full view 500,000 routes) but fair memory usage (20 MiB for a full view).
In a previous article, we had a look at IPv4 route lookup on Linux. Let s see how different IPv6 is.

Lookup trie implementation Looking up a prefix in a routing table comes down to find the most specific entry matching the requested destination. A common structure for this task is the trie, a tree structure where each node has its parent as prefix. With IPv4, Linux uses a level-compressed trie (or LPC-trie), providing good performances with low memory usage. For IPv6, Linux uses a more classic radix tree (or Patricia trie). There are three reasons for not sharing:
  • The IPv6 implementation (introduced in Linux 2.1.8, 1996) predates the IPv4 implementation based on LPC-tries (in Linux 2.6.13, commit 19baf839ff4a).
  • The feature set is different. Notably, IPv6 supports source-specific routing1 (since Linux 2.1.120, 1998).
  • The IPv4 address space is denser than the IPv6 address space. Level-compression is therefore quite efficient with IPv4. This may not be the case with IPv6.
The trie in the below illustration encodes 6 prefixes: Radix tree For more in-depth explanation on the different ways to encode a routing table into a trie and a better understanding of radix trees, see the explanations for IPv4. The following figure shows the in-memory representation of the previous radix tree. Each node corresponds to a struct fib6_node. When a node has the RTN_RTINFO flag set, it embeds a pointer to a struct rt6_info containing information about the next-hop. Memory representation of a routing table The fib6_lookup_1() function walks the radix tree in two steps:
  1. walking down the tree to locate the potential candidate, and
  2. checking the candidate and, if needed, backtracking until a match.
Here is a slightly simplified version without source-specific routing:
static struct fib6_node *fib6_lookup_1(struct fib6_node *root,
                                       struct in6_addr  *addr)
 
    struct fib6_node *fn;
    __be32 dir;
    /* Step 1: locate potential candidate */
    fn = root;
    for (;;)  
        struct fib6_node *next;
        dir = addr_bit_set(addr, fn->fn_bit);
        next = dir ? fn->right : fn->left;
        if (next)  
            fn = next;
            continue;
         
        break;
     
    /* Step 2: check prefix and backtrack if needed */
    while (fn)  
        if (fn->fn_flags & RTN_RTINFO)  
            struct rt6key *key;
            key = fn->leaf->rt6i_dst;
            if (ipv6_prefix_equal(&key->addr, addr, key->plen))  
                if (fn->fn_flags & RTN_RTINFO)
                    return fn;
             
         
        if (fn->fn_flags & RTN_ROOT)
            break;
        fn = fn->parent;
     
    return NULL;
 

Caching While IPv4 lost its route cache in Linux 3.6 (commit 5e9965c15ba8), IPv6 still has a caching mechanism. However cache entries are directly put in the radix tree instead of a distinct structure. Since Linux 2.1.30 (1997) and until Linux 4.2 (commit 45e4fd26683c), almost any successful route lookup inserts a cache entry in the radix tree. For example, a router forwarding a ping between 2001:db8:1::1 and 2001:db8:3::1 would get those two cache entries:
$ ip -6 route show cache
2001:db8:1::1 dev r2-r1  metric 0
    cache
2001:db8:3::1 via 2001:db8:2::2 dev r2-r3  metric 0
    cache
These entries are cleaned up by the ip6_dst_gc() function controlled by the following parameters:
$ sysctl -a   grep -F net.ipv6.route
net.ipv6.route.gc_elasticity = 9
net.ipv6.route.gc_interval = 30
net.ipv6.route.gc_min_interval = 0
net.ipv6.route.gc_min_interval_ms = 500
net.ipv6.route.gc_thresh = 1024
net.ipv6.route.gc_timeout = 60
net.ipv6.route.max_size = 4096
net.ipv6.route.mtu_expires = 600
The garbage collector is triggered at most every 500 ms when there are more than 1024 entries or at least every 30 seconds. The garbage collection won t run for more than 60 seconds, except if there are more than 4096 routes. When running, it will first delete entries older than 30 seconds. If the number of cache entries is still greater than 4096, it will continue to delete more recent entries (but no more recent than 512 jiffies, which is the value of gc_elasticity) after a 500 ms pause. Starting from Linux 4.2 (commit 45e4fd26683c), only a PMTU exception would create a cache entry. A router doesn t have to handle those exceptions, so only hosts would get cache entries. And they should be pretty rare. Martin KaFai Lau explains:
Out of all IPv6 RTF_CACHE routes that are created, the percentage that has a different MTU is very small. In one of our end-user facing proxy server, only 1k out of 80k RTF_CACHE routes have a smaller MTU. For our DC traffic, there is no MTU exception.
Here is how a cache entry with a PMTU exception looks like:
$ ip -6 route show cache
2001:db8:1::50 via 2001:db8:1::13 dev out6  metric 0
    cache  expires 573sec mtu 1400 pref medium

Performance We consider three distinct scenarios:
Excerpt of an Internet full view
In this scenario, Linux acts as an edge router attached to the default-free zone. Currently, the size of such a routing table is a little bit above 40,000 routes.
/48 prefixes spread linearly with different densities
Linux acts as a core router inside a datacenter. Each customer or rack gets one or several /48 networks, which need to be routed around. With a density of 1, /48 subnets are contiguous.
/128 prefixes spread randomly in a fixed /108 subnet
Linux acts as a leaf router for a /64 subnet with hosts getting their IP using autoconfiguration. It is assumed all hosts share the same OUI and therefore, the first 40 bits are fixed. In this scenario, neighbor reachability information for the /64 subnet are converted into routes by some external process and redistributed among other routers sharing the same subnet2.

Route lookup performance With the help of a small kernel module, we can accurately benchmark3 the ip6_route_output_flags() function and correlate the results with the radix tree size: Maximum depth and lookup time Getting meaningful results is challenging due to the size of the address space. None of the scenarios have a fallback route and we only measure time for successful hits4. For the full view scenario, only the range from 2400::/16 to 2a06::/16 is scanned (it contains more than half of the routes). For the /128 scenario, the whole /108 subnet is scanned. For the /48 scenario, the range from the first /48 to the last one is scanned. For each range, 5000 addresses are picked semi-randomly. This operation is repeated until we get 5000 hits or until 1 million tests have been executed. The relation between the maximum depth and the lookup time is incomplete and I can t explain the difference of performance between the different densities of the /48 scenario. We can extract two important performance points:
  • With a full view, the lookup time is 450 ns. This is almost ten times the budget for forwarding at 10 Gbps which is about 50 ns.
  • With an almost empty routing table, the lookup time is 150 ns. This is still over the time budget for forwarding at 10 Gbps.
With IPv4, the lookup time for an almost empty table was 20 ns while the lookup time for a full view (500,000 routes) was a bit above 50 ns. How to explain such a difference? First, the maximum depth of the IPv4 LPC-trie with 500,000 routes was 6, while the maximum depth of the IPv6 radix tree for 40,000 routes is 40. Second, while both IPv4 s fib_lookup() and IPv6 s ip6_route_output_flags() functions have a fixed cost implied by the evaluation of routing rules, IPv4 has several optimizations when the rules are left unmodified5. Those optimizations are removed on the first modification. If we cancel those optimizations, the lookup time for IPv4 is impacted by about 30 ns. This still leaves a 100 ns difference with IPv6 to be explained. Let s compare how time is spent in each lookup function. Here is a CPU flamegraph for IPv4 s fib_lookup(): IPv4 route lookup flamegraph Only 50% of the time is spent in the actual route lookup. The remaining time is spent evaluating the routing rules (about 30 ns). This ratio is dependent on the number of routes we inserted (only 1000 in this example). It should be noted the fib_table_lookup() function is executed twice: once with the local routing table and once with the main routing table. The equivalent flamegraph for IPv6 s ip6_route_output_flags() is depicted below: IPv6 route lookup flamegraph Here is an approximate breakdown on the time spent:
  • 50% is spent in the route lookup in the main table,
  • 15% is spent in handling locking (IPv4 is using the more efficient RCU mechanism),
  • 5% is spent in the route lookup of the local table,
  • most of the remaining is spent in routing rule evaluation (about 100 ns)6.
Why does the evaluation of routing rules is less efficient with IPv6? Again, I don t have a definitive answer.

History The following graph shows the performance progression of route lookups through Linux history: IPv6 route lookup performance progression All kernels are compiled with GCC 4.9 (from Debian Jessie). This version is able to compile older kernels as well as current ones. The kernel configuration is the default one with CONFIG_SMP, CONFIG_IPV6, CONFIG_IPV6_MULTIPLE_TABLES and CONFIG_IPV6_SUBTREES options enabled. Some other unrelated options are enabled to be able to boot them in a virtual machine and run the benchmark. There are three notable performance changes:
  • In Linux 3.1, Eric Dumazet delays a bit the copy of route metrics to fix the undesirable sharing of route-specific metrics by all cache entries (commit 21efcfa0ff27). Each cache entry now gets its own metrics, which explains the performance hit for the non-/128 scenarios.
  • In Linux 3.9, Yoshifuji Hideaki removes the reference to the neighbor entry in struct rt6_info (commit 887c95cc1da5). This should have lead to a performance increase. The small regression may be due to cache-related issues.
  • In Linux 4.2, Martin KaFai Lau prevents the creation of cache entries for most route lookups. The most sensible performance improvement comes with commit 4b32b5ad31a6. The second one is from commit 45e4fd26683c, which effectively removes creation of cache entries, except for PMTU exceptions.

Insertion performance Another interesting performance-related metric is the insertion time. Linux is able to insert a full view in less than two seconds. For some reason, the insertion time is not linear above 50,000 routes and climbs very fast to 60 seconds for 500,000 routes. Insertion time Despite its more complex insertion logic, the IPv4 subsystem is able to insert 2 million routes in less than 10 seconds.

Memory usage Radix tree nodes (struct fib6_node) and routing information (struct rt6_info) are allocated with the slab allocator7. It is therefore possible to extract the information from /proc/slabinfo when the kernel is booted with the slab_nomerge flag:
# sed -ne 2p -e '/^ip6_dst/p' -e '/^fib6_nodes/p' /proc/slabinfo   cut -f1 -d:
   name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>
fib6_nodes         76101  76104     64   63    1
ip6_dst_cache      40090  40090    384   10    1
In the above example, the used memory is 76104 64+40090 384 bytes (about 20 MiB). The number of struct rt6_info matches the number of routes while the number of nodes is roughly twice the number of routes: Nodes The memory usage is therefore quite predictable and reasonable, as even a small single-board computer can support several full views (20 MiB for each): Memory usage The LPC-trie used for IPv4 is more efficient: when 512 MiB of memory is needed for IPv6 to store 1 million routes, only 128 MiB are needed for IPv4. The difference is mainly due to the size of struct rt6_info (336 bytes) compared to the size of IPv4 s struct fib_alias (48 bytes): IPv4 puts most information about next-hops in struct fib_info structures that are shared with many entries.

Conclusion The takeaways from this article are:
  • upgrade to Linux 4.2 or more recent to avoid excessive caching,
  • route lookups are noticeably slower compared to IPv4 (by an order of magnitude),
  • CONFIG_IPV6_MULTIPLE_TABLES option incurs a fixed penalty of 100 ns by lookup,
  • memory usage is fair (20 MiB for 40,000 routes).
Compared to IPv4, IPv6 in Linux doesn t foster the same interest, notably in term of optimizations. Hopefully, things are changing as its adoption and use at scale are increasing.

  1. For a given destination prefix, it s possible to attach source-specific prefixes:
    ip -6 route add 2001:db8:1::/64 \
      from 2001:db8:3::/64 \
      via fe80::1 \
      dev eth0
    
    Lookup is first done on the destination address, then on the source address.
  2. This is quite different of the classic scenario where Linux acts as a gateway for a /64 subnet. In this case, the neighbor subsystem stores the reachability information for each host and the routing table only contains a single /64 prefix.
  3. The measurements are done in a virtual machine with one vCPU and no neighbors. The host is an Intel Core i5-4670K running at 3.7 GHz during the experiment (CPU governor set to performance). The benchmark is single-threaded. Many lookups are performed and the result reported is the median value. Timings of individual runs are computed from the TSC.
  4. Most of the packets in the network are expected to be routed to a destination. However, this also means the backtracking code path is not used in the /128 and /48 scenarios. Having a fallback route gives far different results and make it difficult to ensure we explore the address space correctly.
  5. The exact same optimizations could be applied for IPv6. Nobody did it yet.
  6. Compiling out table support effectively removes those last 100 ns.
  7. There is also per-CPU pointers allocated directly (4 bytes per entry per CPU on a 64-bit architecture). We ignore this detail.

31 July 2017

Russell Coker: Running a Tor Relay

I previously wrote about running my SE Linux Play Machine over Tor [1] which involved configuring ssh to use Tor. Since then I have installed a Tor hidden service for ssh on many systems I run for clients. The reason is that it is fairly common for them to allow a server to get a new IP address by DHCP or accidentally set their firewall to deny inbound connections. Without some sort of VPN this results in difficult phone calls talking non-technical people through the process of setting up a tunnel or discovering an IP address. While I can run my own VPN for them I don t want their infrastructure tied to mine and they don t want to pay for a 3rd party VPN service. Tor provides a free VPN service and works really well for this purpose. As I believe in giving back to the community I decided to run my own Tor relay. I have no plans to ever run a Tor Exit Node because that involves more legal problems than I am willing or able to deal with. A good overview of how Tor works is the EFF page about it [2]. The main point of a Middle Relay (or just Relay ) is that it only sends and receives encrypted data from other systems. As the Relay software (and the sysadmin if they choose to examine traffic) only sees encrypted data without any knowledge of the source or final destination the legal risk is negligible. Running a Tor relay is quite easy to do. The Tor project has a document on running relays [3], which basically involves changing 4 lines in the torrc file and restarting Tor. If you are running on Debian you should install the package tor-geoipdb to allow Tor to determine where connections come from (and to not whinge in the log files). ORPort [IPV6ADDR]:9001 If you want to use IPv6 then you need a line like the above with IPV6ADDR replaced by the address you want to use. Currently Tor only supports IPv6 for connections between Tor servers and only for the data transfer not the directory services. Data Transfer I currently have 2 systems running as Tor relays, both of them are well connected in a European DC and they are each transferring about 10GB of data per day which isn t a lot by server standards. I don t know if there is a sufficient number of relays around the world that the share of the load is small or if there is some geographic dispersion algorithm which determined that there are too many relays in operation in that region.

30 July 2017

Niels Thykier: Introducing the debhelper buildlabel prototype for multi-building packages

For most packages, the dh short-hand rules (possibly with a few overrides) work great. It can often auto-detect the buildsystem and handle all the trivial parts. With one notably exception: What if you need to compile the upstream code twice (or more) with different flags? This is the case for all source packages building both regular debs and udebs. In that case, you would previously need to override about 5-6 helpers for this to work at all. The five dh_auto_* helpers and usually also dh_install (to call it with different sourcedir for different packages). This gets even more complex if you want to support Build-Profiles such as noudeb and nodoc . The best way to support nodoc in debhelper is to move documentation out of dh_install s config files and use dh_installman, dh_installdocs, and dh_installexamples instead (NB: wait for compat 11 before doing this). This in turn will mean more overrides with sourcedir and -p/-N. And then there is noudeb , which currently requires manual handling in debian/rules. Basically, you need to use make or shell if-statements to conditionally skip the udeb part of the builds. All of this is needlessly complex. Improving the situation In an attempt to make things better, I have made a new prototype feature in debhelper called buildlabels in experimental. The current prototype is designed to deal with part (but not all) of the above problems: However, it currently not solve the need for overriding the dh_auto_* tools and I am not sure when/if it will. The feature relies on being able to relate packages to a given series of calls to dh_auto_*. In the following example, I will use udebs for the secondary build. However, this feature is not tied to udebs in any way and can be used any source package that needs to do two or more upstream builds for different packages. Assume our example source builds the following binary packages: And in the rules file, we would have something like:
[...]
override_dh_auto_configure:
    dh_auto_configure -B build-deb -- --with-feature1 --with-feature2
    dh_auto_configure -B build-udeb -- --without-feature1 --without-feature2
[...]
What is somewhat obvious to a human is that the first configure line is related to the regular debs and the second configure line is for the udebs. However, debhelper does not know how to infer this and this is where buildlabels come in. With buildlabels, you can let debhelper know which packages and builds that belong together. How to use buildlabels To use buildlabels, you have to do three things:
  1. Pick a reasonable label name for the secondary build. In the example, I will use udeb .
  2. Add buildlabel=$LABEL to all dh_auto_* calls related to your secondary build.
  3. Tag all packages related to my-label with X-DH-Buildlabel: $LABEL in debian/control. (For udeb packages, you may want to add Build-Profiles: <!noudeb> while you are at it).
For the example package, we would change the debian/rules snippet to:
[...]
override_dh_auto_configure:
    dh_auto_configure -B build-deb -- --with-feature1 --with-feature2
    dh_auto_configure --buildlabel=udeb -B build-udeb -- --without-feature1 --without-feature2
[...]
(Remember to update *all* calls to dh_auto_* helpers; the above only lists dh_auto_configure to keep the example short.) And then add X-DH-Buildlabel: udeb in the stanzas for foo-udeb + libfoo1-udeb. With those two minor changes: Real example Thanks to Michael Biebl, I was able to make an branch in the systemd git repository to play with this feature. Therefore I have an real example to use as a show case. The gist of it is in the following three commits: Full branch can be seen at: https://anonscm.debian.org/git/pkg-systemd/systemd.git/log/?h=wip-dh-prototype-smarter-multi-builds Request for comments / call for testing This prototype is now in experimental (debhelper/10.7+exp.buildlabels) and you are very welcome to take it for a spin. Please let me know if you find the idea useful and feel free to file bugs or feature requests. If deemed useful, I will merge into master and include in a future release. If you have any questions or comments about the feature or need help with trying it out, you are also very welcome to mail the debhelper-devel mailing list. Known issues / the fine print:
Filed under: Debhelper, Debian

3 July 2017

Vincent Bernat: Performance progression of IPv4 route lookup on Linux

TL;DR: Each of Linux 2.6.39, 3.6 and 4.0 brings notable performance improvements for the IPv4 route lookup process.
In a previous article, I explained how Linux implements an IPv4 routing table with compressed tries to offer excellent lookup times. The following graph shows the performance progression of Linux through history: IPv4 route lookup performance Two scenarios are tested: All kernels are compiled with GCC 4.9 (from Debian Jessie). This version is able to compile older kernels1 as well as current ones. The kernel configuration used is the default one with CONFIG_SMP and CONFIG_IP_MULTIPLE_TABLES options enabled (however, no IP rules are used). Some other unrelated options are enabled to be able to boot them in a virtual machine and run the benchmark. The measurements are done in a virtual machine with one vCPU2. The host is an Intel Core i5-4670K and the CPU governor was set to performance . The benchmark is single-threaded. Implemented as a kernel module, it calls fib_lookup() with various destinations in 100,000 timed iterations and keeps the median. Timings of individual runs are computed from the TSC (and converted to nanoseconds by assuming a constant clock). The following kernel versions bring a notable performance improvement:

  1. Compiling old kernels with an updated userland may still require some small patches.
  2. The kernels are compiled with the CONFIG_SMP option to use the hierarchical RCU and activate more of the same code paths as actual routers. However, progress on parallelism are left unnoticed.

2 July 2017

Bits from Debian: New Debian Developers and Maintainers (May and June 2017)

The following contributors got their Debian Developer accounts in the last two months: The following contributors were added as Debian Maintainers in the last two months: Congratulations!

21 June 2017

Vincent Bernat: IPv4 route lookup on Linux

TL;DR: With its implementation of IPv4 routing tables using LPC-tries, Linux offers good lookup performance (50 ns for a full view) and low memory usage (64 MiB for a full view).
During the lifetime of an IPv4 datagram inside the Linux kernel, one important step is the route lookup for the destination address through the fib_lookup() function. From essential information about the datagram (source and destination IP addresses, interfaces, firewall mark, ), this function should quickly provide a decision. Some possible options are: Since 2.6.39, Linux stores routes into a compressed prefix tree (commit 3630b7c050d9). In the past, a route cache was maintained but it has been removed1 in Linux 3.6.

Route lookup in a trie Looking up a route in a routing table is to find the most specific prefix matching the requested destination. Let s assume the following routing table:
$ ip route show scope global table 100
default via 203.0.113.5 dev out2
192.0.2.0/25
        nexthop via 203.0.113.7  dev out3 weight 1
        nexthop via 203.0.113.9  dev out4 weight 1
192.0.2.47 via 203.0.113.3 dev out1
192.0.2.48 via 203.0.113.3 dev out1
192.0.2.49 via 203.0.113.3 dev out1
192.0.2.50 via 203.0.113.3 dev out1
Here are some examples of lookups and the associated results:
Destination IP Next hop
192.0.2.49 203.0.113.3 via out1
192.0.2.50 203.0.113.3 via out1
192.0.2.51 203.0.113.7 via out3 or 203.0.113.9 via out4 (ECMP)
192.0.2.200 203.0.113.5 via out2
A common structure for route lookup is the trie, a tree structure where each node has its parent as prefix.

Lookup with a simple trie The following trie encodes the previous routing table: Simple routing trie For each node, the prefix is known by its path from the root node and the prefix length is the current depth. A lookup in such a trie is quite simple: at each step, fetch the nth bit of the IP address, where n is the current depth. If it is 0, continue with the first child. Otherwise, continue with the second. If a child is missing, backtrack until a routing entry is found. For example, when looking for 192.0.2.50, we will find the result in the corresponding leaf (at depth 32). However for 192.0.2.51, we will reach 192.0.2.50/31 but there is no second child. Therefore, we backtrack until the 192.0.2.0/25 routing entry. Adding and removing routes is quite easy. From a performance point of view, the lookup is done in constant time relative to the number of routes (due to maximum depth being capped to 32). Quagga is an example of routing software still using this simple approach.

Lookup with a path-compressed trie In the previous example, most nodes only have one child. This leads to a lot of unneeded bitwise comparisons and memory is also wasted on many nodes. To overcome this problem, we can use path compression: each node with only one child is removed (except if it also contains a routing entry). Each remaining node gets a new property telling how many input bits should be skipped. Such a trie is also known as a Patricia trie or a radix tree. Here is the path-compressed version of the previous trie: Patricia trie Since some bits have been ignored, on a match, a final check is executed to ensure all bits from the found entry are matching the input IP address. If not, we must act as if the entry wasn t found (and backtrack to find a matching prefix). The following figure shows two IP addresses matching the same leaf: Lookup in a Patricia trie The reduction on the average depth of the tree compensates the necessity to handle those false positives. The insertion and deletion of a routing entry is still easy enough. Many routing systems are using Patricia trees:

Lookup with a level-compressed trie In addition to path compression, level compression2 detects parts of the trie that are densily populated and replace them with a single node and an associated vector of 2k children. This node will handle k input bits instead of just one. For example, here is a level-compressed version our previous trie: Level-compressed trie Such a trie is called LC-trie or LPC-trie and offers higher lookup performances compared to a radix tree. An heuristic is used to decide how many bits a node should handle. On Linux, if the ratio of non-empty children to all children would be above 50% when the node handles an additional bit, the node gets this additional bit. On the other hand, if the current ratio is below 25%, the node loses the responsibility of one bit. Those values are not tunable. Insertion and deletion becomes more complex but lookup times are also improved.

Implementation in Linux The implementation for IPv4 in Linux exists since 2.6.13 (commit 19baf839ff4a) and is enabled by default since 2.6.39 (commit 3630b7c050d9). Here is the representation of our example routing table in memory3: Memory representation of a trie There are several structures involved: The trie can be retrieved through /proc/net/fib_trie:
$ cat /proc/net/fib_trie
Id 100:
  +-- 0.0.0.0/0 2 0 2
      -- 0.0.0.0
        /0 universe UNICAST
     +-- 192.0.2.0/26 2 0 1
         -- 192.0.2.0
           /25 universe UNICAST
         -- 192.0.2.47
           /32 universe UNICAST
        +-- 192.0.2.48/30 2 0 1
            -- 192.0.2.48
              /32 universe UNICAST
            -- 192.0.2.49
              /32 universe UNICAST
            -- 192.0.2.50
              /32 universe UNICAST
[...]
For internal nodes, the numbers after the prefix are:
  1. the number of bits handled by the node,
  2. the number of full children (they only handle one bit),
  3. the number of empty children.
Moreover, if the kernel was compiled with CONFIG_IP_FIB_TRIE_STATS, some interesting statistics are available in /proc/net/fib_triestat4:
$ cat /proc/net/fib_triestat
Basic info: size of leaf: 48 bytes, size of tnode: 40 bytes.
Id 100:
        Aver depth:     2.33
        Max depth:      3
        Leaves:         6
        Prefixes:       6
        Internal nodes: 3
          2: 3
        Pointers: 12
Null ptrs: 4
Total size: 1  kB
[...]
When a routing table is very dense, a node can handle many bits. For example, a densily populated routing table with 1 million entries packed in a /12 can have one internal node handling 20 bits. In this case, route lookup is essentially reduced to a lookup in a vector. The following graph shows the number of internal nodes used relative to the number of routes for different scenarios (routes extracted from an Internet full view, /32 routes spreaded over 4 different subnets with various densities). When routes are densily packed, the number of internal nodes are quite limited. Internal nodes and null pointers

Performance So how performant is a route lookup? The maximum depth stays low (about 6 for a full view), so a lookup should be quite fast. With the help of a small kernel module, we can accurately benchmark5 the fib_lookup() function: Maximum depth and lookup time The lookup time is loosely tied to the maximum depth. When the routing table is densily populated, the maximum depth is low and the lookup times are fast. When forwarding at 10 Gbps, the time budget for a packet would be about 50 ns. Since this is also the time needed for the route lookup alone in some cases, we wouldn t be able to forward at line rate with only one core. Nonetheless, the results are pretty good and they are expected to scale linearly with the number of cores. The measurements are done with a Linux kernel 4.11 from Debian unstable. I have gathered performance metrics accross kernel versions in Performance progression of IPv4 route lookup on Linux . Another interesting figure is the time it takes to insert all those routes into the kernel. Linux is also quite efficient in this area since you can insert 2 million routes in less than 10 seconds: Insertion time

Memory usage The memory usage is available directly in /proc/net/fib_triestat. The statistic provided doesn t account for the fib_info structures, but you should only have a handful of them (one for each possible next-hop). As you can see on the graph below, the memory use is linear with the number of routes inserted, whatever the shape of the routes is. Memory usage The results are quite good. With only 256 MiB, about 2 million routes can be stored!

Routing rules Unless configured without CONFIG_IP_MULTIPLE_TABLES, Linux supports several routing tables and has a system of configurable rules to select the table to use. These rules can be configured with ip rule. By default, there are three of them:
$ ip rule show
0:      from all lookup local
32766:  from all lookup main
32767:  from all lookup default
Linux will first lookup for a match in the local table. If it doesn t find one, it will lookup in the main table and at last resort, the default table.

Builtin tables The local table contains routes for local delivery:
$ ip route show table local
broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1
local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1
broadcast 192.168.117.0 dev eno1 proto kernel scope link src 192.168.117.55
local 192.168.117.55 dev eno1 proto kernel scope host src 192.168.117.55
broadcast 192.168.117.63 dev eno1 proto kernel scope link src 192.168.117.55
This table is populated automatically by the kernel when addresses are configured. Let s look at the three last lines. When the IP address 192.168.117.55 was configured on the eno1 interface, the kernel automatically added the appropriate routes:
  • a route for 192.168.117.55 for local unicast delivery to the IP address,
  • a route for 192.168.117.255 for broadcast delivery to the broadcast address,
  • a route for 192.168.117.0 for broadcast delivery to the network address.
When 127.0.0.1 was configured on the loopback interface, the same kind of routes were added to the local table. However, a loopback address receives a special treatment and the kernel also adds the whole subnet to the local table. As a result, you can ping any IP in 127.0.0.0/8:
$ ping -c1 127.42.42.42
PING 127.42.42.42 (127.42.42.42) 56(84) bytes of data.
64 bytes from 127.42.42.42: icmp_seq=1 ttl=64 time=0.039 ms
--- 127.42.42.42 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.039/0.039/0.039/0.000 ms
The main table usually contains all the other routes:
$ ip route show table main
default via 192.168.117.1 dev eno1 proto static metric 100
192.168.117.0/26 dev eno1 proto kernel scope link src 192.168.117.55 metric 100
The default route has been configured by some DHCP daemon. The connected route (scope link) has been automatically added by the kernel (proto kernel) when configuring an IP address on the eno1 interface. The default table is empty and has little use. It has been kept when the current incarnation of advanced routing has been introduced in Linux 2.1.68 after a first tentative using classes in Linux 2.1.156.

Performance Since Linux 4.1 (commit 0ddcf43d5d4a), when the set of rules is left unmodified, the main and local tables are merged and the lookup is done with this single table (and the default table if not empty). Moreover, since Linux 3.0 (commit f4530fa574df), without specific rules, there is no performance hit when enabling the support for multiple routing tables. However, as soon as you add new rules, some CPU cycles will be spent for each datagram to evaluate them. Here is a couple of graphs demonstrating the impact of routing rules on lookup times: Routing rules impact on performance For some reason, the relation is linear when the number of rules is between 1 and 100 but the slope increases noticeably past this threshold. The second graph highlights the negative impact of the first rule (about 30 ns). A common use of rules is to create virtual routers: interfaces are segregated into domains and when a datagram enters through an interface from domain A, it should use routing table A:
# ip rule add iif vlan457 table 10
# ip rule add iif vlan457 blackhole
# ip rule add iif vlan458 table 20
# ip rule add iif vlan458 blackhole
The blackhole rules may be removed if you are sure there is a default route in each routing table. For example, we add a blackhole default with a high metric to not override a regular default route:
# ip route add blackhole default metric 9999 table 10
# ip route add blackhole default metric 9999 table 20
# ip rule add iif vlan457 table 10
# ip rule add iif vlan458 table 20
To reduce the impact on performance when many interface-specific rules are used, interfaces can be attached to VRF instances and a single rule can be used to select the appropriate table:
# ip link add vrf-A type vrf table 10
# ip link set dev vrf-A up
# ip link add vrf-B type vrf table 20
# ip link set dev vrf-B up
# ip link set dev vlan457 master vrf-A
# ip link set dev vlan458 master vrf-B
# ip rule show
0:      from all lookup local
1000:   from all lookup [l3mdev-table]
32766:  from all lookup main
32767:  from all lookup default
The special l3mdev-table rule was automatically added when configuring the first VRF interface. This rule will select the routing table associated to the VRF owning the input (or output) interface. VRF was introduced in Linux 4.3 (commit 193125dbd8eb), the performance was greatly enhanced in Linux 4.8 (commit 7889681f4a6c) and the special routing rule was also introduced in Linux 4.8 (commit 96c63fa7393d, commit 1aa6c4f6b8cd). You can find more details about it in the kernel documentation.

Conclusion The takeaways from this article are:
  • route lookup times hardly increase with the number of routes,
  • densily packed /32 routes lead to amazingly fast route lookups,
  • memory use is low (128 MiB par million routes),
  • no optimization is done on routing rules.

  1. The routing cache was subject to reasonably easy to launch denial of service attacks. It was also believed to not be efficient for high volume sites like Google but I have first-hand experience it was not the case for moderately high volume sites.
  2. IP-address lookup using LC-tries , IEEE Journal on Selected Areas in Communications, 17(6):1083-1092, June 1999.
  3. For internal nodes, the key_vector structure is embedded into a tnode structure. This structure contains information rarely used during lookup, notably the reference to the parent that is usually not needed for backtracking as Linux keeps the nearest candidate in a variable.
  4. One leaf can contain several routes (struct fib_alias is a list). The number of prefixes can therefore be greater than the number of leaves. The system also keeps statistics about the distribution of the internal nodes relative to the number of bits they handle. In our example, all the three internal nodes are handling 2 bits.
  5. The measurements are done in a virtual machine with one vCPU. The host is an Intel Core i5-4670K running at 3.7 GHz during the experiment (CPU governor was set to performance). The benchmark is single-threaded. It runs a warm-up phase, then executes about 100,000 timed iterations and keeps the median. Timings of individual runs are computed from the TSC.
  6. Fun fact: the documentation of this first tentative of more flexible routing is still available in today s kernel tree and explains the usage of the default class .

27 May 2017

Russ Allbery: On time management

Last December, the Guardian published a long essay by Oliver Burkeman entitled "Why time management is ruining our lives". Those who follow my book reviews know I read a lot of time management books, so of course I couldn't resist this. And, possibly surprisingly, not to disagree with it. It's an excellent essay, and well worth your time. Burkeman starts by talking about Inbox Zero:
If all this fervour seems extreme Inbox Zero was just a set of technical instructions for handling email, after all this was because email had become far more than a technical problem. It functioned as a kind of infinite to-do list, to which anyone on the planet could add anything at will.
This is, as Burkeman develops in the essay, an important critique of time management techniques in general, not just Inbox Zero: perhaps you can become moderately more efficient, but what are you becoming more efficient at doing, and why does it matter? If there were a finite amount of things that you had to accomplish, with leisure the reward at the end of the fixed task list, doing those things more efficiently makes perfect sense. But this is not the case in most modern life. Instead, we live in a world governed by Parkinson's Law: "Work expands to fill the time available for its completion." Worse, we live in a world where the typical employer takes Parkinson's Law, not as a statement on the nature of ever-expanding to-do lists, but a challenge to compress the time made available for a task to try to force the work to happen faster. Burkeman goes farther into the politics, pointing out that a cui bono analysis of time management suggests that we're all being played by capitalist employers. I wholeheartedly agree, but that's worth a separate discussion; for those who want to explore that angle, David Graeber's Debt and John Kenneth Galbraith's The Affluent Society are worth your time. What I want to write about here is why I still read (and recommend) time management literature, and how my thinking on it has changed. I started in the same place that most people probably do: I had a bunch of work to juggle, I felt I was making insufficient forward progress on it, and I felt my day contained a lot of slack that could be put to better use. The alluring promise of time management is that these problems can be resolved with more organization and some focus techniques. And there is a huge surge of energy that comes with adopting a new system and watching it work, since the good ones build psychological payoff into the tracking mechanism. Starting a new time management system is fun! Finishing things is fun! I then ran into the same problem that I think most people do: after that initial surge of enthusiasm, I had lists, systems, techniques, data on where my time was going, and a far more organized intake process. But I didn't feel more comfortable with how I was spending my time, I didn't have more leisure time, and I didn't feel happier. Often the opposite: time management systems will often force you to notice all the things you want to do and how slow your progress is towards accomplishing any of them. This is my fundamental disagreement with Getting Things Done (GTD): David Allen firmly believes that the act of recording everything that is nagging at you to be done relieves the brain of draining background processing loops and frees you to be more productive. He argues for this quite persuasively; as you can see from my review, I liked his book a great deal, and used his system for some time. But, at least for me, this does not work. Instead, having a complete list of goals towards which I am making slow or no progress is profoundly discouraging and depressing. The process of maintaining and dwelling on that list while watching it constantly grow was awful, quite a bit worse psychologically than having no time management system at all. Mark Forster is the time management author who speaks the best to me, and one of the points he makes is that time management is the wrong framing. You're not going to somehow generate more time, and you're usually not managing minutes and seconds. A better framing is task management, or commitment management: the goal of the system is to manage what you mentally commit to accomplishing, usually by restricting that list to something far shorter than you would come up with otherwise. How, in other words, to limit your focus to a small enough set of goals that you can make meaningful progress instead of thrashing. That, for me, is now the merit and appeal of time (or task) management systems: how do I sort through all the incoming noise, distractions, requests, desires, and compelling ideas that life throws at me and figure out which of them are worth investing time in? I also benefit from structuring that process for my peculiar psychology, in which backlogs I have to look at regularly are actively dangerous for my mental well-being. Left unchecked, I can turn even the most enjoyable hobby into an obligation and then into a source of guilt for not meeting the (entirely artificial) terms of the obligation I created, without even intending to. And here I think it has a purpose, but it's not the purpose that the time management industry is selling. If you think of time management as a way to get more things done and get more out of each moment, you're going to be disappointed (and you're probably also being taken advantage of by the people who benefit from unsustainable effort without real, unstructured leisure time). I practice Inbox Zero, but the point wasn't to be more efficient at processing my email. The point was to avoid the (for me) psychologically damaging backlog of messages while acting on the knowledge that 99% of email should go immediately into the trash with no further action. Email is an endless incoming stream of potential obligations or requests for my time (even just to read a longer message) that I should normallly reject. I also take the time to notice patterns of email that I never care about and then shut off the source or write filters to delete that email for me. I can then reserve my email time for moments of human connection, directly relevant information, or very interesting projects, and spend the time on those messages without guilt (or at least much less guilt) about ignoring everything else. Prioritization is extremely difficult, particularly once you realize that true prioritization is not about first and later, but about soon or never. The point of prioritization is not to choose what to do first, it's to choose the 5% of things that you going to do at all, convince yourself to be mentally okay with never doing the other 95% (and not lying to yourself about how there will be some future point when you'll magically have more time), and vigorously defend your focus and effort for that 5%. And, hopefully, wholeheartedly enjoy working on those things, without guilt or nagging that there's something else you should be doing instead. I still fail at this all the time. But I'm better than I used to be. For me, that mental shift was by far the hardest part. But once you've made that shift, I do think the time management world has a lot of tools and techniques to help you make more informed choices about the 5%, and to help you overcome procrastination and loss of focus on your real goals. Those real goals should include true unstructured leisure and "because I want to" projects. And hopefully, if you're in a financial position to do it, include working less on what other people want you to do and more on the things that delight you. Or at least making a well-informed strategic choice (for the sake of money or some other concrete and constantly re-evaluated reason) to sacrifice your personal goals for some temporary external ones.

24 May 2017

Steve Kemp: Getting ready for Stretch

I run about 17 servers. Of those about six are very personal and the rest are a small cluster which are used for a single website. (Partly because the code is old and in some ways a bit badly designed, partly because "clustering!", "high availability!", "learning!", "fun!" - seriously I had a lot of fun putting together a fault-tolerant deployment with haproxy, ucarp, etc, etc. If I were paying for it the site would be both retired and static!) I've started the process of upgrading to stretch by picking a bunch of hosts that do things I could live without for a few days - in case there were big problems, or I needed to restore from backups. So far I've upgraded: All upgrades were painless, with only one real surprise - the attic-backup software was removed from Debian. Although I do intend to retry using Larss' excellent obnum in the near future pragmatically I wanted to stick with what I'm familiar with. Borg backup is a fork of attic I've been aware of for a long time, but I never quite had a reason to try it out. Setting it up pretty much just meant editing my backup-script:
s/attic/borg/g
Once I did that, and created some new destinations all was good:
borg@rsync.io ~ $ borg init /backups/git.steve.org.uk.borg/
borg@rsync.io ~ $ borg init /backups/master.steve.org.uk.borg/
borg@rsync.io ~ $ ..
Upgrading other hosts, for example my website(s), and my email-box, will be more complex and fiddly. On that basis they will definitely wait for the formal stretch release. But having a couple of hosts running the frozen distribution is good for testing, and to let me see what is new.

22 May 2017

Gunnar Wolf: Open Source Symposium 2017

I travelled (for three days only!) to Argentina, to be a part of the Open Source Symposium 2017, a co-located event of the International Conference on Software Engineering.

This is, all in all, an interesting although small conference We are around 30 people in the room. This is a quite unusual conference for me, as this is among the first "formal" academic conference I am part of. Sessions have so far been quite interesting.
What am I linking to from this image? Of course, the proceedings! They managed to publish the proceedings via the "formal" academic channels (a nice hard-cover Springer volume) under an Open Access license (which is sadly not usual, and is unbelievably expensive). So, you can download the full proceedings, or article by article, in EPUB or in PDF...
...Which is very very nice :)
Previous editions of this symposium have also their respective proceedings available, but AFAICT they have not been downloadable.
So, get the book; it provides very interesant and original insights into our community seen from several quite novel angles!
AttachmentSize
oss2017_cover.png84.47 KB

3 May 2017

Vincent Bernat: VXLAN: BGP EVPN with Cumulus Quagga

VXLAN is an overlay network to encapsulate Ethernet traffic over an existing (highly available and scalable, possibly the Internet) IP network while accomodating a very large number of tenants. It is defined in RFC 7348. For an uncut introduction on its use with Linux, have a look at my VXLAN & Linux post. VXLAN deployment In the above example, we have hypervisors hosting a virtual machines from different tenants. Each virtual machine is given access to a tenant-specific virtual Ethernet segment. Users are expecting classic Ethernet segments: no MAC restrictions1, total control over the IP addressing scheme they use and availability of multicast. In a large VXLAN deployment, two aspects need attention:
  1. discovery of other endpoints (VTEPs) sharing the same VXLAN segments, and
  2. avoidance of BUM frames (broadcast, unknown unicast and multicast) as they have to be forwarded to all VTEPs.
A typical solution for the first point is using multicast. For the second point, this is source-address learning.

Introduction to BGP EVPN BGP EVPN (RFC 7432 and draft-ietf-bess-evpn-overlay for its application to VXLAN) is a standard control protocol to efficiently solves those two aspects without relying on multicast nor source-address learning. BGP EVPN relies on BGP (RFC 4271) and its MP-BGP extensions (RFC 4760). BGP is the routing protocol powering the Internet. It is highly scalable and interoperable. It is also extensible and one of its extension is MP-BGP. This extension can carry reachability information (NLRI) for multiple protocols (IPv4, IPv6, L3VPN and in our case EVPN). EVPN is a special family to advertise MAC addresses and the remote equipments they are attached to. There are basically two kinds of reachability information a VTEP sends through BGP EVPN:
  1. the VNIs they have interest in (type 3 routes), and
  2. for each VNI, the local MAC addresses (type 2 routes).
The protocol also covers other aspects of virtual Ethernet segments (L3 reachability information from ARP/ND caches, MAC mobility and multi-homing2) but we won t describe them here. To deploy BGP EVPN, a typical solution is to use several route reflectors (both for redundancy and scalability), like in the picture below. Each VTEP opens a BGP session to at least two route reflectors, sends its information (MACs and VNIs) and receives others . This reduces the number of BGP sessions to configure. VXLAN deployment with route reflectors Compared to other solutions to deploy VXLAN, BGP EVPN has three main advantages:
  • interoperability with other vendors (notably Juniper and Cisco),
  • proven scalability (a typical BGP routers handle several millions of routes), and
  • possibility to enforce fine-grained policies.
On Linux, Cumulus Quagga is a fairly complete implementation of BGP EVPN (type 3 routes for VTEP discovery, type 2 routes with MAC or IP addresses, MAC mobility when a host changes from one VTEP to another one) which requires very little configuration. This is a fork of Quagga and currently used in Cumulus Linux, a network operating system based on Debian powering switches from various brands. At some point, BGP EVPN support will be contributed back to FRR, a community-maintained fork of Quagga3. It should be noted the BGP EVPN implementation of Cumulus Quagga currently only supports IPv4.

Route reflector setup Before configuring each VTEP, we need to configure two or more route reflectors. There are many solutions. I will present three of them:
  • using Cumulus Quagga,
  • using GoBGP, an implementation of BGP in Go,
  • using Juniper JunOS.
For reliability purpose, it s possible (and easy) to use one implementation for some route reflectors and another implementation for the other ones. The proposed configurations are quite minimal. However, it is possible to centralize policies on the route reflectors (e.g. routes tagged with some community can only be readvertised to some group of VTEPs).

Using Quagga The configuration is pretty simple. We suppose the configured route reflector has 203.0.113.254 configured as a loopback IP.
router bgp 65000
  bgp router-id 203.0.113.254
  bgp cluster-id 203.0.113.254
  bgp log-neighbor-changes
  no bgp default ipv4-unicast
  neighbor fabric peer-group
  neighbor fabric remote-as 65000
  neighbor fabric capability extended-nexthop
  neighbor fabric update-source 203.0.113.254
  bgp listen range 203.0.113.0/24 peer-group fabric
  !
  address-family evpn
   neighbor fabric activate
   neighbor fabric route-reflector-client
  exit-address-family
  !
  exit
!
A peer group fabric is defined and we leverage the dynamic neighbor feature of Cumulus Quagga: we don t have to explicitely define each neighbor. Any client from 203.0.113.0/24 and presenting itself as part of AS 65000 can connect. All sent EVPN routes will be accepted and reflected to the other clients. You don t need to run Zebra, the route engine talking with the kernel. Instead, start bgpd with the --no_kernel flag.

Using GoBGP GoBGP is a clean implementation of BGP in Go4. It exposes an RPC API for configuration (but accepts a configuration file and comes with a command-line client). It doesn t support dynamic neighbors, so you ll have to use the API, the command-line client or some templating language to automate their declaration. A configuration with only one neighbor is like this:
global:
  config:
    as: 65000
    router-id: 203.0.113.254
    local-address-list:
      - 203.0.113.254
neighbors:
  - config:
      neighbor-address: 203.0.113.1
      peer-as: 65000
    afi-safis:
      - config:
          afi-safi-name: l2vpn-evpn
    route-reflector:
      config:
        route-reflector-client: true
        route-reflector-cluster-id: 203.0.113.254
More neighbors can be added from the command line:
$ gobgp neighbor add 203.0.113.2 as 65000 \
>         route-reflector-client 203.0.113.254 \
>         --address-family evpn
GoBGP won t try to interact with the kernel which is fine as a route reflector.

Using Juniper JunOS A variety of Juniper products can be a BGP route reflector, notably: The main factor is the CPU and the memory. The QFX5100 is low on memory and won t support large deployments without some additional policing. Here is a configuration similar to the Quagga one:
interfaces  
    lo0  
        unit 0  
            family inet  
                address 203.0.113.254/32;
             
         
     
 
protocols  
    bgp  
        group fabric  
            family evpn  
                signaling  
                    /* Do not try to install EVPN routes */
                    no-install;
                 
             
            type internal;
            cluster 203.0.113.254;
            local-address 203.0.113.254;
            allow 203.0.113.0/24;
         
     
 
routing-options  
    router-id 203.0.113.254;
    autonomous-system 65000;
 

VTEP setup The next step is to configure each VTEP/hypervisor. Each VXLAN is locally configured using a bridge for local virtual interfaces, like illustrated in the below schema. The bridge is taking care of the local MAC addresses (notably, using source-address learning) and the VXLAN interface takes care of the remote MAC addresses (received with BGP EVPN). Bridged VXLAN device VXLANs can be provisioned with the following script. Source-address learning is disabled as we will rely solely on BGP EVPN to synchronize FDBs between the hypervisors.
for vni in 100 200; do
    # Create VXLAN interface
    ip link add vxlan$ vni  type vxlan
        id $ vni  \
        dstport 4789 \
        local 203.0.113.2 \
        nolearning
    # Create companion bridge
    brctl addbr br$ vni 
    brctl addif br$ vni  vxlan$ vni 
    brctl stp br$ vni  off
    ip link set up dev br$ vni 
    ip link set up dev vxlan$ vni 
done
# Attach each VM to the appropriate segment
brctl addif br100 vnet10
brctl addif br100 vnet11
brctl addif br200 vnet12
The configuration of Cumulus Quagga is similar to the one used for a route reflector, except we use the advertise-all-vni directive to publish all local VNIs.
router bgp 65000
  bgp router-id 203.0.113.2
  no bgp default ipv4-unicast
  neighbor fabric peer-group
  neighbor fabric remote-as 65000
  neighbor fabric capability extended-nexthop
  neighbor fabric update-source dummy0
  ! BGP sessions with route reflectors
  neighbor 203.0.113.253 peer-group fabric
  neighbor 203.0.113.254 peer-group fabric
  !
  address-family evpn
   neighbor fabric activate
   advertise-all-vni
  exit-address-family
  !
  exit
!
If everything works as expected, the instances sharing the same VNI should be able to ping each other. If IPv6 is enabled on the VMs, the ping command shows if everything is in order:
$ ping -c10 -w1 -t1 ff02::1%eth0
PING ff02::1%eth0(ff02::1%eth0) 56 data bytes
64 bytes from fe80::5254:33ff:fe00:8%eth0: icmp_seq=1 ttl=64 time=0.016 ms
64 bytes from fe80::5254:33ff:fe00:b%eth0: icmp_seq=1 ttl=64 time=4.98 ms (DUP!)
64 bytes from fe80::5254:33ff:fe00:9%eth0: icmp_seq=1 ttl=64 time=4.99 ms (DUP!)
64 bytes from fe80::5254:33ff:fe00:a%eth0: icmp_seq=1 ttl=64 time=4.99 ms (DUP!)
--- ff02::1%eth0 ping statistics ---
1 packets transmitted, 1 received, +3 duplicates, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.016/3.745/4.991/2.152 ms

Verification Step by step, let s check how everything comes together.

Getting VXLAN information from the kernel On each VTEP, Quagga should be able to retrieve the information about configured VXLANs. This can be checked with vtysh:
# show interface vxlan100
Interface vxlan100 is up, line protocol is up
  Link ups:       1    last: 2017/04/29 20:01:33.43
  Link downs:     0    last: (never)
  PTM status: disabled
  vrf: Default-IP-Routing-Table
  index 11 metric 0 mtu 1500
  flags: <UP,BROADCAST,RUNNING,MULTICAST>
  Type: Ethernet
  HWaddr: 62:42:7a:86:44:01
  inet6 fe80::6042:7aff:fe86:4401/64
  Interface Type Vxlan
  VxLAN Id 100
  Access VLAN Id 1
  Master (bridge) ifindex 9 ifp 0x56536e3f3470
The important points are:
  • the VNI is 100, and
  • the bridge device was correctly detected.
Quagga should also be able to retrieve information about the local MAC addresses :
# show evpn mac vni 100
Number of MACs (local and remote) known for this VNI: 2
MAC               Type   Intf/Remote VTEP      VLAN
50:54:33:00:00:0a local  eth1.100
50:54:33:00:00:0b local  eth2.100

BGP sessions Each VTEP has to establish a BGP session to the route reflectors. On the VTEP, this can be checked by running vtysh:
# show bgp neighbors 203.0.113.254
BGP neighbor is 203.0.113.254, remote AS 65000, local AS 65000, internal link
 Member of peer-group fabric for session parameters
  BGP version 4, remote router ID 203.0.113.254
  BGP state = Established, up for 00:00:45
  Neighbor capabilities:
    4 Byte AS: advertised and received
    AddPath:
      L2VPN EVPN: RX advertised L2VPN EVPN
    Route refresh: advertised and received(new)
    Address family L2VPN EVPN: advertised and received
    Hostname Capability: advertised
    Graceful Restart Capabilty: advertised
[...]
 For address family: L2VPN EVPN
  fabric peer-group member
  Update group 1, subgroup 1
  Packet Queue length 0
  Community attribute sent to this neighbor(both)
  8 accepted prefixes

  Connections established 1; dropped 0
  Last reset never
Local host: 203.0.113.2, Local port: 37603
Foreign host: 203.0.113.254, Foreign port: 179
The output includes the following information:
  • the BGP state is Established,
  • the address family L2VPN EVPN is correctly advertised, and
  • 8 routes are received from this route reflector.
The state of the BGP sessions can also be checked from the route reflectors. With GoBGP, use the following command:
# gobgp neighbor 203.0.113.2
BGP neighbor is 203.0.113.2, remote AS 65000, route-reflector-client
  BGP version 4, remote router ID 203.0.113.2
  BGP state = established, up for 00:04:30
  BGP OutQ = 0, Flops = 0
  Hold time is 9, keepalive interval is 3 seconds
  Configured hold time is 90, keepalive interval is 30 seconds
  Neighbor capabilities:
    multiprotocol:
        l2vpn-evpn:     advertised and received
    route-refresh:      advertised and received
    graceful-restart:   received
    4-octet-as: advertised and received
    add-path:   received
    UnknownCapability(73):      received
    cisco-route-refresh:        received
[...]
  Route statistics:
    Advertised:             8
    Received:               5
    Accepted:               5
With JunOS, use the below command:
> show bgp neighbor 203.0.113.2
Peer: 203.0.113.2+38089 AS 65000 Local: 203.0.113.254+179 AS 65000
  Group: fabric                Routing-Instance: master
  Forwarding routing-instance: master
  Type: Internal    State: Established
  Last State: OpenConfirm   Last Event: RecvKeepAlive
  Last Error: None
  Options: <Preference LocalAddress Cluster AddressFamily Rib-group Refresh>
  Address families configured: evpn
  Local Address: 203.0.113.254 Holdtime: 90 Preference: 170
  NLRI evpn: NoInstallForwarding
  Number of flaps: 0
  Peer ID: 203.0.113.2     Local ID: 203.0.113.254     Active Holdtime: 9
  Keepalive Interval: 3          Group index: 0    Peer index: 2
  I/O Session Thread: bgpio-0 State: Enabled
  BFD: disabled, down
  NLRI for restart configured on peer: evpn
  NLRI advertised by peer: evpn
  NLRI for this session: evpn
  Peer supports Refresh capability (2)
  Stale routes from peer are kept for: 300
  Peer does not support Restarter functionality
  NLRI that restart is negotiated for: evpn
  NLRI of received end-of-rib markers: evpn
  NLRI of all end-of-rib markers sent: evpn
  Peer does not support LLGR Restarter or Receiver functionality
  Peer supports 4 byte AS extension (peer-as 65000)
  NLRI's for which peer can receive multiple paths: evpn
  Table bgp.evpn.0 Bit: 20000
    RIB State: BGP restart is complete
    RIB State: VPN restart is complete
    Send state: in sync
    Active prefixes:              5
    Received prefixes:            5
    Accepted prefixes:            5
    Suppressed due to damping:    0
    Advertised prefixes:          8
  Last traffic (seconds): Received 276  Sent 170  Checked 276
  Input messages:  Total 61     Updates 3       Refreshes 0     Octets 1470
  Output messages: Total 62     Updates 4       Refreshes 0     Octets 1775
  Output Queue[1]: 0            (bgp.evpn.0, evpn)
If a BGP session cannot be established, the logs of each BGP daemon should mention the cause.

Sent routes From each VTEP, Quagga needs to send:
  • one type 3 route for each local VNI, and
  • one type 2 route for each local MAC address.
The best place to check the received routes is on one of the route reflectors. If you are using JunOS, the following command will display the received routes from the provided VTEP:
> show route table bgp.evpn.0 receive-protocol bgp 203.0.113.2
bgp.evpn.0: 10 destinations, 10 routes (10 active, 0 holddown, 0 hidden)
  Prefix                  Nexthop              MED     Lclpref    AS path
  2:203.0.113.2:100::0::50:54:33:00:00:0a/304 MAC/IP
*                         203.0.113.2                  100        I
  2:203.0.113.2:100::0::50:54:33:00:00:0b/304 MAC/IP
*                         203.0.113.2                  100        I
  3:203.0.113.2:100::0::203.0.113.2/304 IM
*                         203.0.113.2                  100        I
  3:203.0.113.2:200::0::203.0.113.2/304 IM
*                         203.0.113.2                  100        I
There is one type 3 route for VNI 100 and another one for VNI 200. There are also two type 2 routes for two MAC addresses on VNI 100. To get more information, you can add the keyword extensive. Here is a type 3 route advertising 203.0.113.2 as a VTEP for VNI 1008:
> show route table bgp.evpn.0 receive-protocol bgp 203.0.113.2 extensive
bgp.evpn.0: 11 destinations, 11 routes (11 active, 0 holddown, 0 hidden)
* 3:203.0.113.2:100::0::203.0.113.2/304 IM (1 entry, 1 announced)
     Accepted
     Route Distinguisher: 203.0.113.2:100
     Nexthop: 203.0.113.2
     Localpref: 100
     AS path: I
     Communities: target:65000:268435556 encapsulation:vxlan(0x8)
[...]
Here is a type 2 route announcing the location of the 50:54:33:00:00:0a MAC address for VNI 100:
> show route table bgp.evpn.0 receive-protocol bgp 203.0.113.2 extensive
bgp.evpn.0: 11 destinations, 11 routes (11 active, 0 holddown, 0 hidden)
* 2:203.0.113.2:100::0::50:54:33:00:00:0a/304 MAC/IP (1 entry, 1 announced)
     Accepted
     Route Distinguisher: 203.0.113.2:100
     Route Label: 100
     ESI: 00:00:00:00:00:00:00:00:00:00
     Nexthop: 203.0.113.2
     Localpref: 100
     AS path: I
     Communities: target:65000:268435556 encapsulation:vxlan(0x8)
[...]
With Quagga, you can get a similar output with vtysh:
# show bgp evpn route
BGP table version is 0, local router ID is 203.0.113.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
   Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 203.0.113.2:100
*>i[2]:[0]:[0]:[48]:[50:54:33:00:00:0a]
                    203.0.113.2                   100      0 i
*>i[2]:[0]:[0]:[48]:[50:54:33:00:00:0b]
                    203.0.113.2                   100      0 i
*>i[3]:[0]:[32]:[203.0.113.2]
                    203.0.113.2                   100      0 i
Route Distinguisher: 203.0.113.2:200
*>i[3]:[0]:[32]:[203.0.113.2]
                    203.0.113.2                   100      0 i
[...]
With GoBGP, use the following command:
# gobgp global rib -a evpn   grep rd:203.0.113.2:200
    Network  Next Hop             AS_PATH              Age        Attrs
*>  [type:macadv][rd:203.0.113.2:100][esi:single-homed][etag:0][mac:50:54:33:00:00:0a][ip:<nil>][labels:[100]]203.0.113.2                               00:00:17   [ Origin: i   LocalPref: 100   Extcomms: [VXLAN], [65000:268435556] ]
*>  [type:macadv][rd:203.0.113.2:100][esi:single-homed][etag:0][mac:50:54:33:00:00:0b][ip:<nil>][labels:[100]]203.0.113.2                               00:00:17   [ Origin: i   LocalPref: 100   Extcomms: [VXLAN], [65000:268435556] ]
*>  [type:macadv][rd:203.0.113.2:200][esi:single-homed][etag:0][mac:50:54:33:00:00:0a][ip:<nil>][labels:[200]]203.0.113.2                               00:00:17   [ Origin: i   LocalPref: 100   Extcomms: [VXLAN], [65000:268435656] ]
*>  [type:multicast][rd:203.0.113.2:100][etag:0][ip:203.0.113.2]203.0.113.2                               00:00:17   [ Origin: i   LocalPref: 100   Extcomms: [VXLAN], [65000:268435556] ]
*>  [type:multicast][rd:203.0.113.2:200][etag:0][ip:203.0.113.2]203.0.113.2                               00:00:17   [ Origin: i   LocalPref: 100   Extcomms: [VXLAN], [65000:268435656] ]

Received routes Each VTEP should have received the type 2 and type 3 routes from its fellow VTEPs, through the route reflectors. You can check with the show bgp evpn route command of vtysh. Does Quagga correctly understand the received routes? The type 3 routes are translated to an assocation between the remote VTEPs and the VNIs:
# show evpn vni
Number of VNIs: 2
VNI        VxLAN IF              VTEP IP         # MACs   # ARPs   Remote VTEPs
100        vxlan100              203.0.113.2     4        0        203.0.113.3
                                                                   203.0.113.1
200        vxlan200              203.0.113.2     3        0        203.0.113.3
                                                                   203.0.113.1
The type 2 routes are translated to an association between the remote MACs and the remote VTEPs:
# show evpn mac vni 100
Number of MACs (local and remote) known for this VNI: 4
MAC               Type   Intf/Remote VTEP      VLAN
50:54:33:00:00:09 remote 203.0.113.1
50:54:33:00:00:0a local  eth1.100
50:54:33:00:00:0b local  eth2.100
50:54:33:00:00:0c remote 203.0.113.3

FDB configuration The last step is to ensure Quagga has correctly provided the received information to the kernel. This can be checked with the bridge command:
# bridge fdb show dev vxlan100   grep dst
00:00:00:00:00:00 dst 203.0.113.1 self permanent
00:00:00:00:00:00 dst 203.0.113.3 self permanent
50:54:33:00:00:0c dst 203.0.113.3 self
50:54:33:00:00:09 dst 203.0.113.1 self
All good! The two first lines are the translation of the type 3 routes (any BUM frame will be sent to both 203.0.113.1 and 203.0.113.3) and the two last ones are the translation of the type 2 routes.

Interoperability One of the strength of BGP EVPN is the interoperability with other network vendors. To demonstrate it works as expected, we will configure a Juniper vMX to act as a VTEP. First, we need to configure the physical bridge9. This is similar to the use of ip link and brctl with Linux. We only configure one physical interface with two old-school VLANs paired with matching VNIs.
interfaces  
    ge-0/0/1  
        unit 0  
            family bridge  
                interface-mode trunk;
                vlan-id-list [ 100 200 ];
             
         
     
 
routing-instances  
    switch  
        instance-type virtual-switch;
        interface ge-0/0/1.0;
        bridge-domains  
            vlan100  
                domain-type bridge;
                vlan-id 100;
                vxlan  
                    vni 100;
                    ingress-node-replication;
                 
             
            vlan200  
                domain-type bridge;
                vlan-id 200;
                vxlan  
                    vni 200;
                    ingress-node-replication;
                 
             
         
     
 
Then, we configure BGP EVPN to advertise all known VNIs. The configuration is quite similar to the one we did with Quagga:
protocols  
    bgp  
        group fabric  
            type internal;
            multihop;
            family evpn signaling;
            local-address 203.0.113.3;
            neighbor 203.0.113.253;
            neighbor 203.0.113.254;
         
     
 
routing-instances  
    switch  
        vtep-source-interface lo0.0;
        route-distinguisher 203.0.113.3:1; #  
        vrf-import EVPN-VRF-VXLAN;
        vrf-target  
            target:65000:1;
            auto;
         
        protocols  
            evpn  
                encapsulation vxlan;
                extended-vni-list all;
                multicast-mode ingress-replication;
             
         
     
 
routing-options  
    router-id 203.0.113.3;
    autonomous-system 65000;
 
policy-options  
    policy-statement EVPN-VRF-VXLAN  
        then accept;
     
 
We also need a small compatibility patch for Cumulus Quagga10. The routes sent by this configuration are very similar to the routes sent by Quagga. The main differences are:
  • on JunOS, the route distinguisher is configured statically (in ), and
  • on JunOS, the VNI is also encoded as an Ethernet tag ID.
Here is a type 3 route, as sent by JunOS:
> show route table bgp.evpn.0 receive-protocol bgp 203.0.113.3 extensive
bgp.evpn.0: 13 destinations, 13 routes (13 active, 0 holddown, 0 hidden)
* 3:203.0.113.3:1::100::203.0.113.3/304 IM (1 entry, 1 announced)
     Accepted
     Route Distinguisher: 203.0.113.3:1
     Nexthop: 203.0.113.3
     Localpref: 100
     AS path: I
     Communities: target:65000:268435556 encapsulation:vxlan(0x8)
     PMSI: Flags 0x0: Label 6: Type INGRESS-REPLICATION 203.0.113.3
[...]
Here is a type 2 route:
> show route table bgp.evpn.0 receive-protocol bgp 203.0.113.3 extensive
bgp.evpn.0: 13 destinations, 13 routes (13 active, 0 holddown, 0 hidden)
* 2:203.0.113.3:1::200::50:54:33:00:00:0f/304 MAC/IP (1 entry, 1 announced)
     Accepted
     Route Distinguisher: 203.0.113.3:1
     Route Label: 200
     ESI: 00:00:00:00:00:00:00:00:00:00
     Nexthop: 203.0.113.3
     Localpref: 100
     AS path: I
     Communities: target:65000:268435656 encapsulation:vxlan(0x8)
[...]
We can check that the vMX is able to make sense of the routes it receives from its peers running Quagga:
> show evpn database l2-domain-id 100
Instance: switch
VLAN  DomainId  MAC address        Active source                  Timestamp        IP address
     100        50:54:33:00:00:0c  203.0.113.1                    Apr 30 12:46:20
     100        50:54:33:00:00:0d  203.0.113.2                    Apr 30 12:32:42
     100        50:54:33:00:00:0e  203.0.113.2                    Apr 30 12:46:20
     100        50:54:33:00:00:0f  ge-0/0/1.0                     Apr 30 12:45:55
On the other end, if we look at one of the Quagga-based VTEP, we can check the received routes are correctly understood:
# show evpn vni 100
VNI: 100
 VxLAN interface: vxlan100 ifIndex: 9 VTEP IP: 203.0.113.1
 Remote VTEPs for this VNI:
  203.0.113.3
  203.0.113.2
 Number of MACs (local and remote) known for this VNI: 4
 Number of ARPs (IPv4 and IPv6, local and remote) known for this VNI: 0
# show evpn mac vni 100
Number of MACs (local and remote) known for this VNI: 4
MAC               Type   Intf/Remote VTEP      VLAN
50:54:33:00:00:0c local  eth1.100
50:54:33:00:00:0d remote 203.0.113.2
50:54:33:00:00:0e remote 203.0.113.2
50:54:33:00:00:0f remote 203.0.113.3
Get in touch if you have some success with other vendors!

  1. For example, they may use bridges to connect containers together.
  2. Such a feature can replace proprietary implementations of MC-LAG allowing several VTEPs to act as a endpoint for a single link aggregation group. This is not needed on our scenario where hypervisors act as VTEPs.
  3. The development of Quagga is slow and closed . New features are often stalled. FRR is placed under the umbrella of the Linux Foundation, has a GitHub-centered development model and an election process. It already has several interesting enhancements (notably, BGP add-path, BGP unnumbered, MPLS and LDP).
  4. I am unenthusiastic about projects whose the sole purpose is to rewrite something in Go. However, while being quite young, GoBGP is quite valuable on its own (good architecture, good performance).
  5. The 48-port version is around $10,000 with the BGP license.
  6. An empty chassis with a dual routing engine (RE-S-1800X4-16G) is around $30,000.
  7. I don t know how pricey the vRR is. For evaluation purposes, it can be downloaded for free if you are a customer.
  8. The value 100 used in the route distinguishier (203.0.113.2:100) is not the one used to encode the VNI. The VNI is encoded in the route target (65000:268435556), in the 24 least signifiant bits (268435556 & 0xffffff equals 100). As long as VNIs are unique, we don t have to understand those details.
  9. For some reason, the use of a virtual switch is mandatory. This is specific to this platform: a QFX doesn t require this.
  10. The encoding of the VNI into the route target is being standardized in draft-ietf-bess-evpn-overlay. Juniper already implements this draft.

Vincent Bernat: VXLAN & Linux

VXLAN is an overlay network to carry Ethernet traffic over an existing (highly available and scalable) IP network while accommodating a very large number of tenants. It is defined in RFC 7348. Starting from Linux 3.12, the VXLAN implementation is quite complete as both multicast and unicast are supported as well as IPv6 and IPv4. Let s explore the various methods to configure it. VXLAN setup To illustrate our examples, we use the following setup: A VXLAN tunnel extends the individual Ethernet segments accross the three bridges, providing a unique (virtual) Ethernet segment. From one host (e.g. H1), we can reach directly all the other hosts in the virtual segment:
$ ping -c10 -w1 -t1 ff02::1%eth0
PING ff02::1%eth0(ff02::1%eth0) 56 data bytes
64 bytes from fe80::5254:33ff:fe00:8%eth0: icmp_seq=1 ttl=64 time=0.016 ms
64 bytes from fe80::5254:33ff:fe00:b%eth0: icmp_seq=1 ttl=64 time=4.98 ms (DUP!)
64 bytes from fe80::5254:33ff:fe00:9%eth0: icmp_seq=1 ttl=64 time=4.99 ms (DUP!)
64 bytes from fe80::5254:33ff:fe00:a%eth0: icmp_seq=1 ttl=64 time=4.99 ms (DUP!)
--- ff02::1%eth0 ping statistics ---
1 packets transmitted, 1 received, +3 duplicates, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.016/3.745/4.991/2.152 ms

Basic usage The reference deployment for VXLAN is to use an IP multicast group to join the other VTEPs:
# ip -6 link add vxlan100 type vxlan \
>   id 100 \
>   dstport 4789 \
>   local 2001:db8:1::1 \
>   group ff05::100 \
>   dev eth0 \
>   ttl 5
# brctl addbr br100
# brctl addif br100 vxlan100
# brctl addif br100 vnet22
# brctl addif br100 vnet25
# brctl stp br100 off
# ip link set up dev br100
# ip link set up dev vxlan100
The above commands create a new interface acting as a VXLAN tunnel endpoint, named vxlan100 and put it in a bridge with some regular interfaces1. Each VXLAN segment is associated to a 24-bit segment ID, the VXLAN Network Identifier (VNI). In our example, the default VNI is specified with id 100. When VXLAN was first implemented in Linux 3.7, the UDP port to use was not defined. Several vendors were using 8472 and Linux took the same value. To avoid breaking existing deployments, this is still the default value. Therefore, if you want to use the IANA-assigned port, you need to explicitely set it with dstport 4789. As we want to use multicast, we have to specify a multicast group to join (group ff05::100), as well as a physical device (dev eth0). With multicast, the default TTL is 1. If your multicast network leverages some routing, you ll have to increase the value a bit, like here with ttl 5. The vxlan100 device acts as a bridge device with remote VTEPs as virtual ports:
  • it sends broadcast, unknown unicast and multicast (BUM) frames to all VTEPs using the multicast group, and
  • it discovers the association from Ethernet MAC addresses to VTEP IP addresses using source-address learning.
The following figure summarizes the configuration, with the FDB of the Linux bridge (learning local MAC addresses) and the FDB of the VXLAN device (learning distant MAC addresses): Bridged VXLAN device The FDB of the VXLAN device can be observed with the bridge command. If the destination MAC is present, the frame is sent to the associated VTEP (unicast). The all-zero address is only used when a lookup for the destination MAC fails.
# bridge fdb show dev vxlan100   grep dst
00:00:00:00:00:00 dst ff05::100 via eth0 self permanent
50:54:33:00:00:0b dst 2001:db8:3::1 self
50:54:33:00:00:08 dst 2001:db8:1::1 self
If you are interested to get more details on how to setup a multicast network and build VXLAN segments on top of it, see my Network virtualization with VXLAN article.

Without multicast Using VXLAN over a multicast IP network has several benefits:
  • automatic discovery of other VTEPs sharing the same multicast group,
  • good bandwidth usage (packets are replicated as late as possible),
  • decentralized and controller-less design2.
However, multicast is not available everywhere and managing it at scale can be difficult. In Linux 3.8, the DOVE extensions have been added to the VXLAN implementation, removing the dependency on multicast.

Unicast with static flooding We can replace multicast by head-end replication of BUM frames to a statically configured lists of remote VTEPs3:
# ip -6 link add vxlan100 type vxlan \
>   id 100 \
>   dstport 4789 \
>   local 2001:db8:1::1
# bridge fdb append 00:00:00:00:00:00 dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 00:00:00:00:00:00 dev vxlan100 dst 2001:db8:3::1
The VXLAN is defined without a remote multicast group. Instead, all the remote VTEPs are associated with the all-zero address: a BUM frame will be duplicated to all those destinations. The VXLAN device will still learn remote addresses automatically using source-address learning. It is a very simple solution. With a bit of automation, you can keep the default FDB entries up-to-date easily. However, the host will have to duplicate each BUM frame (head-end replication) as many times as there are remote VTEPs. This is quite reasonable if you have a dozen of them. This may become out-of-hand if you have thousands of them. Cumulus vxfld daemon is an example of use of this strategy (in the head-end replication mode).

Unicast with static L2 entries When the associations of MAC addresses and VTEPs are known, it is possible to pre-populate the FDB and disable learning:
# ip -6 link add vxlan100 type vxlan \
>   id 100 \
>   dstport 4789 \
>   local 2001:db8:1::1 \
>   nolearning
# bridge fdb append 00:00:00:00:00:00 dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 00:00:00:00:00:00 dev vxlan100 dst 2001:db8:3::1
# bridge fdb append 50:54:33:00:00:09 dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 50:54:33:00:00:0a dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 50:54:33:00:00:0b dev vxlan100 dst 2001:db8:3::1
Thanks to the nolearning flag, source-address learning is disabled. Therefore, if a MAC is missing, the frame will always be sent using the all-zero entries. The all-zero entries are still needed for broadcast and multicast traffic (e.g. ARP and IPv6 neighbor discovery). This kind of setup works well to provide virtual L2 networks to virtual machines (no L3 information available). You need some glue to update the FDB entries. BGP EVPN with Cumulus Quagga is an example of use of this strategy (see VXLAN: BGP EVPN with Cumulus Quagga for additional information).

Unicast with static L3 entries In the previous example, we had to keep the all-zero entries for ARP and IPv6 neighbor discovery to work correctly. However, Linux can answer to neighbor requests on behalf of the remote nodes4. When this feature is enabled, the default entries are not needed anymore (but you could keep them):
# ip -6 link add vxlan100 type vxlan \
>   id 100 \
>   dstport 4789 \
>   local 2001:db8:1::1 \
>   nolearning \
>   proxy
# ip -6 neigh add 2001:db8:ff::11 lladdr 50:54:33:00:00:09 dev vxlan100
# ip -6 neigh add 2001:db8:ff::12 lladdr 50:54:33:00:00:0a dev vxlan100
# ip -6 neigh add 2001:db8:ff::13 lladdr 50:54:33:00:00:0b dev vxlan100
# bridge fdb append 50:54:33:00:00:09 dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 50:54:33:00:00:0a dev vxlan100 dst 2001:db8:2::1
# bridge fdb append 50:54:33:00:00:0b dev vxlan100 dst 2001:db8:3::1
This setup totally eliminates head-end replication. However, protocols relying on multicast won t work either. With some automation, this is a setup that should work well with containers: if there is a registry keeping a list of all IP and MAC addresses in use, a program could listen to it and adjust the FDB and the neighbor tables. The VXLAN backend of Docker s libnetwork is an example of use of this strategy (but it also uses the next method).

Unicast with dynamic L3 entries Linux can also notify a program an (L2 or L3) entry is missing. The program queries some central registry and dynamically adds the requested entry. However, for L2 entries, notifications are issued only if:
  • the destination MAC address is not known,
  • there is no all-zero entry in the FDB, and
  • the destination MAC address is not a multicast or broadcast one.
Those limitations prevent us to do a unicast with dynamic L2 entries scenario. First, let s create the VXLAN device with the l2miss and l3miss options5:
ip -6 link add vxlan100 type vxlan \
   id 100 \
   dstport 4789 \
   local 2001:db8:1::1 \
   nolearning \
   l2miss \
   l3miss \
   proxy
Notifications are sent to programs listening to an AF_NETLINK socket using the NETLINK_ROUTE protocol. This socket needs to be bound to the RTNLGRP_NEIGH group. The following is doing exactly that and decodes the received notifications:
# ip monitor neigh dev vxlan100
miss 2001:db8:ff::12 STALE
miss lladdr 50:54:33:00:00:0a STALE
The first notification is about a missing neighbor entry for the requested IP address. We can add it with the following command:
ip -6 neigh replace 2001:db8:ff::12 \
    lladdr 50:54:33:00:00:0a \
    dev vxlan100 \
    nud reachable
The entry is not permanent so that we don t need to delete it when it expires. If the address becomes stale, we will get another notification to refresh it. Once the host receives our proxy answer for the neighbor discovery request, it can send a frame with the MAC we gave as destination. The second notification is about the missing FDB entry for this MAC address. We add the appropriate entry with the following command6:
bridge fdb replace 50:54:33:00:00:0a \
    dst 2001:db8:2::1 \
    dev vxlan100 dynamic
The entry is not permanent either as it would prevent the MAC to migrate to the local VTEP (a dynamic entry cannot override a permanent entry). This setup works well with containers and a global registry. However, there is small latency penalty for the first connections. Moreover, multicast and broadcast won t be available in the underlay network. The VXLAN backend for flannel, a network fabric for Kubernetes, is an example of this strategy.

Decision There is no one-size-fits-all solution. You should consider the multicast solution if:
  • you are in an environment where multicast is available,
  • you are ready to operate (and scale) a multicast network,
  • you need multicast and broadcast inside the virtual segments,
  • you don t have L2/L3 addresses available beforehand.
The scalability of such a solution is pretty good if you take care of not putting all VXLAN interfaces into the same multicast group (e.g. use the last byte of the VNI as the last byte of the multicast group). When multicast is not available, another generic solution is BGP EVPN: BGP is used as a controller to ensure distribution of the list of VTEPs and their respective FDBs. As mentioned earlier, an implementation of this solution is Cumulus Quagga. I explore this option in a separate post: VXLAN: BGP EVPN with Cumulus Quagga. If you operate in a container-like environment where L2/L3 addresses are known beforehand, a solution using static and/or dynamic L2 and L3 entries based on a central registry and no source-address learning would also fit the bill. This provides a more security-tight solution (bound resources, MiTM attacks dampened down, inability to amplify bandwidth usage through excessive broadcast). Various environment-specific solutions are available7 or you can build your own.

Other considerations Independently of the chosen strategy, here are a few important points to keep in mind when implementing a VXLAN overlay.

Isolation While you may expect VXLAN interfaces to only carry L2 traffic, Linux doesn t disable IP processing. If the destination MAC is a local one, Linux will route or deliver the encapsulated IP packet. Check my post about the proper isolation of a Linux bridge.

Encryption VXLAN enforces isolation between tenants, but the traffic is totally unencrypted. The most direct solution to provide encryption is to use IPsec. Some container-based solutions may come with IPsec support out-of-the box (notably Docker s libnetwork, but flannel has plan for it too). This is quite important for a deployment over a public cloud.

Overhead The format of a VXLAN-encapsulated frame is the following: VXLAN encapsulation VXLAN adds a fixed overhead of 50 bytes. If you also use IPsec, the overhead depends on many factors. In transport mode, with AES and SHA256, the overhead is 56 bytes. With NAT traversal, this is 64 bytes (additional UDP header). In tunnel mode, this is 72 bytes. See Cisco IPsec Overhead Calculator Tool. Some users will expect to be able to use an Ethernet MTU of 1500 for the overlay network. Therefore, the underlay MTU should be increased. If it is not possible, ensure the inner MTU (inside the containers or the virtual machines) is correctly decreased8.

IPv6 While all the examples above are using IPv6, the ecosystem is not quite ready yet. The multicast L2-only strategy works fine with IPv6 but every other scenario currently needs some patches (1, 2, 3). On top of that, IPv6 may not have been implemented in VXLAN-related tools:

Multicast Linux VXLAN implementation doesn t support IGMP snooping. Multicast traffic will be broadcasted to all VTEPs unless multicast MAC addresses are inserted into the FDB.

  1. This is one possible implementation. The bridge is only needed if you require some form of source-address learning for local interfaces. Another strategy is to use MACVLAN interfaces.
  2. The underlay multicast network may still need some central components, like rendez-vous points for PIM-SM protocol. Fortunately, it s possible to make them highly available and scalable (e.g. with Anycast-RP, RFC 4610).
  3. For this example and the following ones, a patch is needed for the ip command (to be included in 4.11) to use IPv6 for transport. In the meantime, here is a quick workaround:
    # ip -6 link add vxlan100 type vxlan \
    >   id 100 \
    >   dstport 4789 \
    >   local 2001:db8:1::1 \
    >   remote 2001:db8:2::1
    # bridge fdb append 00:00:00:00:00:00 \
    >   dev vxlan100 dst 2001:db8:3::1
    
  4. You may have to apply an IPv6-related patch to the kernel (to be included in 4.12).
  5. You have to apply an IPv6-related patch to the kernel (to be included in 4.12) to get appropriate notifications for missing IPv6 addresses.
  6. Directly adding the entry after the first notification would have been smarter to avoid unnecessary retransmissions.
  7. flannel and Docker s libnetwork were already mentioned as they both feature a VXLAN backend. There are also some interesting experiments like BaGPipe BGP for Kubernetes which leverages BGP EVPN and is therefore interoperable with other vendors.
  8. There is no such thing as MTU discovery on an Ethernet segment.

23 April 2017

Shirish Agarwal: Debconf 17 in Montreal

Debconf Montreal 17 logo Before I start, debconf registration opened about a week back depending upon what time-frame you are in. So, if you have used and contributed to Debian or are curious to know and try out Debian, this would be right time to register for the conference. For those who are either financially weak (like yours truly) or those who are a minority in any way can avail those
sponsorships, more information about the sponsorship/travel bursary can be found here. Also make sure that you read, understand and accept the Debconf Code of conduct before trying to register. Even if Debconf does give the bursary, getting a visa for Canada might be tricky but then what would be life without any challenges. In the hierarchy of wants and needs, I would say attending debconf would come in the want category and not the needs category as needs are simply food, clothes and roof over one s head, thankfully all of which have been there for as long as I remember or know. To dream is necessary for life and being part of debconf would be part of that dream. While I have put up couple of talks, I have asked if somebody would be willing to tackle/share some basic things in Debconf 17 as well. Let s see if I m able to get any response to that. There was a recent Debian BSP in Montreal. While I don t know how many bugs were fixed, although UDD shows only 32 bugs remain to be fixed before Stretch could be released. Indian cricket team To put things into perspective, I had gone to a friend s place at Market Yard and came to know of a local cricket league which is being played in the city on the likes/basis of the IPL . The amount they will get is INR 88,000 (which is 1,361 USD, around the cost of ticket to and fro from Pune to Montreal and back as of date) and similar and 10 teams of 11-15 people are vying from it. The winning team will get that amount along with a trophy and signed certificates. Of course at the highest level i.e. the IPL, each night the players would probably earn a million or more for each play. One IPL season pay is what 90% of the people don t/can t earn in their whole lifetime. Montreal Metro image from Wikimedia Commons Apart from the conference, I have shared about the northern lights and the snacks on stand thing, there is one thing more to explore and that is the Montreal Metro . While I was re-reading the wiki entry today I was very much reminded about Pune Metro and Metro Samvaad that I attended couple of weeks before. That too as it was not held too far from my house, around 500 metres so could attend. The only hitch I see is the question mark on fate of 6,133 trees . I have shared one probable solution but then it has its own challenges . In the end, Metro is the need of the hour as vehicles in pune are reaching the 5 million mark while residents are around 3 million plus something depending on whether you chose to believe the 2011 census or the more recentish world population review . In either case/scenario I do not see any improvement unless consistent public transport is available. While Buses don t run on time, it is hoped that the metro would run on time. Some of the similarities between the two system, apart from the fact that from what I read Montreal Metro is mostly subway while Pune Metro would be mostly elevated. There is the idea/thought of having POI (Points of Interest) and free shows etc. in order to attract young and old to travel to destinations to and fro. There are a lot of theatre groups, stand-up comedians, rock and roll as well as classical musical concerts for which people like me would be willing to travel week-ends to attend such shows. Of course, this is all in the head, we will know the reality starting this May and commercial journeys if everything goes as planned in end 2018.
Filed under: Miscellenous Tagged: #Canadian Visa, #Debconf Montreal 2017, #Debconf registration, #Debian BSP, #Debian Stretch, #Environment, #Montreal Metro, #Montreal Subway, #planet-debian, #Points of Interest, #Pune Metro, #travel bursary, Debian, Economics

13 April 2017

Antoine Beaupr : New approaches to network fast paths

With the speed of network hardware now reaching 100 Gbps and distributed denial-of-service (DDoS) attacks going in the Tbps range, Linux kernel developers are scrambling to optimize key network paths in the kernel to keep up. Many efforts are actually geared toward getting traffic out of the costly Linux TCP stack. We have already covered the XDP (eXpress Data Path) patch set, but two new ideas surfaced during the Netconf and Netdev conferences held in Toronto and Montreal in early April 2017. One is a patch set called af_packet, which aims at extracting raw packets from the kernel as fast as possible; the other is the idea of implementing in-kernel layer-7 proxying. There are also user-space network stacks like Netmap, DPDK, or Snabb (which we previously covered). This article aims at clarifying what all those components do and to provide a short status update for the tools we have already covered. We will focus on in-kernel solutions for now. Indeed, user-space tools have a fundamental limitation: if they need to re-inject packets onto the network, they must again pay the expensive cost of crossing the kernel barrier. User-space performance is effectively bounded by that fundamental design. So we'll focus on kernel solutions here. We will start from the lowest part of the stack, the af_packet patch set, and work our way up the stack all the way up to layer-7 and in-kernel proxying.

af_packet v4 John Fastabend presented a new version of a patch set that was first published in January regarding the af_packet protocol family, which is currently used by tcpdump to extract packets from network interfaces. The goal of this change is to allow zero-copy transfers between user-space applications and the NIC (network interface card) transmit and receive ring buffers. Such optimizations are useful for telecommunications companies, which may use it for deep packet inspection or running exotic protocols in user space. Another use case is running a high-performance intrusion detection system that needs to watch large traffic streams in realtime to catch certain types of attacks. Fastabend presented his work during the Netdev network-performance workshop, but also brought the patch set up for discussion during Netconf. There, he said he could achieve line-rate extraction (and injection) of packets, with packet rates as high as 30Mpps. This performance gain is possible because user-space pages are directly DMA-mapped to the NIC, which is also a security concern. The other downside of this approach is that a complete pair of ring buffers needs to be dedicated for this purpose; whereas before packets were copied to user space, now they are memory-mapped, so the user-space side needs to process those packets quickly otherwise they are simply dropped. Furthermore, it's an "all or nothing" approach; while NIC-level classifiers could be used to steer part of the traffic to a specific queue, once traffic hits that queue, it is only accessible through the af_packet interface and not the rest of the regular stack. If done correctly, however, this could actually improve the way user-space stacks access those packets, providing projects like DPDK a safer way to share pages with the NIC, because it is well defined and kernel-controlled. According to Jesper Dangaard Brouer (during review of this article):
This proposal will be a safer way to share raw packet data between user space and kernel space than what DPDK is doing, [by providing] a cleaner separation as we keep driver code in the kernel where it belongs.
During the Netdev network-performance workshop, Fastabend asked if there was a better data structure to use for such a purpose. The goal here is to provide a consistent interface to user space regardless of the driver or hardware used to extract packets from the wire. af_packet currently defines its own packet format that abstracts away the NIC-specific details, but there are other possible formats. For example, someone in the audience proposed the virtio packet format. Alexei Starovoitov rejected this idea because af_packet is a kernel-specific facility while virtio has its own separate specification with its own requirements. The next step for af_packet is the posting of the new "v4" patch set, although Miller warned that this wouldn't get merged until proper XDP support lands in the Intel drivers. The concern, of course, is that the kernel would have multiple incomplete bypass solutions available at once. Hopefully, Fastabend will present the (by then) merged patch set at the next Netdev conference in November.

XDP updates Higher up in the networking stack sits XDP. The af_packet feature differs from XDP in that it does not perform any sort of analysis or mangling of packets; its objective is purely to get the data into and out of the kernel as fast as possible, completely bypassing the regular kernel networking stack. XDP also sits before the networking stack except that, according to Brouer, it is "focused on cooperating with the existing network stack infrastructure, and on use-cases where the packet doesn't necessarily need to leave kernel space (like routing and bridging, or skipping complex code-paths)." XDP has evolved quite a bit since we last covered it in LWN. It seems that most of the controversy surrounding the introduction of XDP in the Linux kernel has died down in public discussions, under the leadership of David Miller, who heralded XDP as the right solution for a long-term architecture in the kernel. He presented XDP as a fast, flexible, and safe solution. Indeed, one of the controversies surrounding XDP was the question of the inherent security challenges with introducing user-provided programs directly into the Linux kernel to mangle packets at such a low level. Miller argued that whatever protections are expected for user-space programs also apply to XDP programs, comparing the virtual memory protections to the eBPF (extended BPF) verifier applied to XDP programs. Those programs are actually eBPF that have an interesting set of restrictions:
  • they have a limited size
  • they cannot jump backward (and thus cannot loop), so they execute in predictable time
  • they do only static allocation, so they are also limited in memory
XDP is not a one-size-fits-all solution: netfilter, the TC traffic shaper, and other normal Linux utilities still have their place. There is, however, a clear use case for a solution like XDP in the kernel. For example, Facebook and Cloudflare have both started testing XDP and, in Facebook's case, deploying XDP in production. Martin Kafai Lau, from Facebook, presented the tool set the company is using to construct a DDoS-resilience solution and a level-4 load balancer (L4LB), which got a ten-times performance improvement over the previous IPVS-based solution. Facebook rolled out its own user-space solution called "Droplet" to detect hostile traffic and deploy blocking rules in the form of eBPF programs loaded in XDP. Lau demonstrated the way Facebook deploys a three-part chained eBPF program: the first part allows debugging and dumping of packets, the second is Droplet itself, which drops undesirable traffic, and the last segment is the load balancer, which mangles the packets to tweak their destination according to internal rules. Droplet can drop DDoS attacks at line rate while keeping the architecture flexible, which were two key design requirements. Gilberto Bertin, from Cloudflare, presented a similar approach: Cloudflare has a tool that processes sFlow data generated from iptables in order to generate cBPF (classic BPF) mitigation rules that are then deployed on edge routers. Those rules are created with a tool called bpfgen, part of Cloudflare's BSD-licensed bpftools suite. For example, it could create a cBPF bytecode blob that would match DNS queries to any example.com domain with something like:
    bpfgen dns *.example.com
Originally, Cloudflare would deploy those rules to plain iptables firewalls with the xt_bpf module, but this led to performance issues. It then deployed a proprietary user-space solution based on Solarflare hardware, but this has the performance limitations of user-space applications getting packets back onto the wire involves the cost of re-injecting packets back into the kernel. This is why Cloudflare is experimenting with XDP, which was partly developed in response to the company's problems, to deploy those BPF programs. A concern that Bertin identified was the lack of visibility into dropped packets. Cloudflare currently samples some of the dropped traffic to analyze attacks; this is not currently possible with XDP unless you pass the packets down the stack, which is expensive. Miller agreed that the lack of monitoring for XDP programs is a large issue that needs to be resolved, and suggested creating a way to mark packets for extraction to allow analysis. Cloudflare is currently in a testing phase with XDP and it is unclear if its whole XDP tool chain will be publicly available. While those two companies are starting to use XDP as-is, there is more work needed to complete the XDP project. As mentioned above and in our previous coverage, massive statistics extraction is still limited in the Linux kernel and introspection is difficult. Furthermore, while the existing actions (XDP_DROP and XDP_TX, see the documentation for more information) are well implemented and used, another action may be introduced, called XDP_REDIRECT, which would allow redirecting packets to different network interfaces. Such an action could also be used to accelerate bridges as packets could be "switched" based on the MAC address table. XDP also requires network driver support, which is currently limited. For example, the Intel drivers still do not support XDP, although that should come pretty soon. Miller, in his Netdev keynote, focused on XDP and presented it as the standard solution that is safe, fast, and usable. He identified the next steps of XDP development to be the addition of debugging mechanisms, better sampling tools for statistics and analysis, and user-space consistency. Miller foresees a future for XDP similar to the popularization of the Arduino chips: a simple set of tools that anyone, not just developers, can use. He gave the example of an Arduino tutorial that he followed where he could just look up a part number and get easy-to-use instructions on how to program it. Similar components should be available for XDP. For this purpose, the conference saw the creation of a new mailing list called xdp-newbies where people can learn how to create XDP build environments and how to write XDP programs.

In-kernel layer-7 proxying The third approach that struck me as innovative is the idea of doing layer-7 (application) proxying directly in the kernel. This comes from the idea that, traditionally, we build firewalls to segregate traffic and apply controls, but as most services move to HTTP, those policies become ineffective. Thomas Graf, presented this idea during Netconf using a Star Wars allegory: what if the Death Star were a server with an API? You would have endpoints like /dock or /comms that would allow you to dock a ship or communicate with the Death Star. Those API endpoints should obviously be public, but then there is this /exhaust-port endpoint that should never be publicly available. In order for a firewall to protect such a system, it must be able to inspect traffic at a higher level than the traditional address-port pairs. Graf presented a design where the kernel would create an in-kernel socket that would negotiate TCP connections on behalf of user space and then be able to apply arbitrary eBPF rules in the kernel. Graf's design of in-kernel proxying In this scenario, instead of doing the traditional transfer from Netfilter's TPROXY to user space, the kernel directly decapsulates the HTTP traffic and passes it to BPF rules that can make decisions without doing expensive context switches or memory copies in the case of simply wanting to refuse traffic (e.g. issue an HTTP 403 error). This, of course, requires the inclusion of kTLS to process HTTPS connections. HTTP2 support may also prove problematic, as it multiplexes connections and is harder to decapsulate. This design was described as a "pure pre-accept() hook". Starovoitov also compared the design to the kernel connection multiplexer (KCM). Tom Herbert, KCM's author, agreed that it could be extended to support this, but would require some extensions in user space to provide an interface between regular socket-based applications and the KCM layer. In any case, if the application does TLS (and lots of them do), kTLS gets tricky because it breaks the end-to-end nature of TLS, in effect becoming a man in the middle between the client and the application. Eric Dumazet argued that HA-Proxy already does things like this: it uses splice() to avoid copying too much data around, but it still does a context switch to hand over processing to user space, something that could be fixed in the general case. Another similar project that was presented at Netdev is the Tempesta firewall and reverse-proxy. The speaker, Alex Krizhanovsky, explained the Tempesta developers have taken one person month to port the mbed TLS stack to the Linux kernel to allow an in-kernel TLS handshake. Tempesta also implements rate limiting, cookies, and JavaScript challenges to mitigate DDoS attacks. The argument behind the project is that "it's easier to move TLS to the kernel than it is to move the TCP/IP stack to user space". Graf explained that he is familiar with Krizhanovsky's work and he is hoping to collaborate. In effect, the design Graf is working on would serve as a foundation for Krizhanovsky's in-kernel HTTP server (kHTTP). In a private email, Graf explained that:
The main differences in the implementation are currently that we foresee to use BPF for protocol parsing to avoid having to implement every single application protocol natively in the kernel. Tempesta likely sees this less of an issue as they are probably only targeting HTTP/1.1 and HTTP/2 and to some [extent] JavaScript.
Neither project is really ready for production yet. There didn't seem to be any significant pushback from key network developers against the idea, which surprised some people, so it is likely we will see more and more layer-7 intelligence move into the kernel sooner rather than later.

Conclusion All of this work aims at replacing a rag-tag bunch of proprietary solutions that recently came up to bypass the Linux kernel TCP/IP stack and improve performance for firewalls, proxies, and other key edge network elements. The idea is that, unless the kernel improves its performance, or at least provides a way to bypass its more complex code paths, people will work around it. With this set of solutions in place, engineers will now be able to use standard APIs to hook high-performance systems into the Linux kernel.
The author would like to thank the Netdev and Netconf organizers for travel assistance, Thomas Graf for a review of the in-kernel proxying section of this article, and Jesper Dangaard Brouer for review of the af_packet and XDP sections. Note: this article first appeared in the Linux Weekly News.

12 April 2017

Vincent Bernat: Proper isolation of a Linux bridge

TL;DR: when configuring a Linux bridge, use the following commands to enforce isolation:
# bridge vlan del dev br0 vid 1 self
# echo 1 > /sys/class/net/br0/bridge/vlan_filtering

A network bridge (also commonly called a switch ) brings several Ethernet segments together. It is a common element in most infrastructures. Linux provides its own implementation. A typical use of a Linux bridge is shown below. The hypervisor is running three virtual hosts. Each virtual host is attached to the br0 bridge (represented by the horizontal segment). The hypervisor has two physical network interfaces: Typical use of Linux bridging with virtual machines The main expectation of such a setup is that while the virtual hosts should be able to use resources from the public network, they should not be able to access resources from the infrastructure network (including resources hosted on the hypervisor itself, like a SSH server). In other words, we expect a total isolation between the green domain and the purple one. That s not the case. From any virtual host:
# ip route add 192.168.14.3/32 dev eth0
# ping -c 3 192.168.14.3
PING 192.168.14.3 (192.168.14.3) 56(84) bytes of data.
64 bytes from 192.168.14.3: icmp_seq=1 ttl=59 time=0.644 ms
64 bytes from 192.168.14.3: icmp_seq=2 ttl=59 time=0.829 ms
64 bytes from 192.168.14.3: icmp_seq=3 ttl=59 time=0.894 ms
--- 192.168.14.3 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2033ms
rtt min/avg/max/mdev = 0.644/0.789/0.894/0.105 ms

Why? There are two main factors of this behavior:
  1. A bridge can accept IP traffic. This is a useful feature if you want Linux to act as a bridge and provide some IP services to bridge users (a DHCP relay or a default gateway). This is usually done by configuring the IP address on the bridge device: ip addr add 192.0.2.2/25 dev br0.
  2. An interface doesn t need an IP address to process incoming IP traffic. Additionally, by default, Linux accepts to answer ARP requests independently from the incoming interface.

Bridge processing After turning an incoming Ethernet frame into a socket buffer, the network driver transfers the buffer to the netif_receive_skb() function. The following actions are executed:
  1. copy the frame to any registered global or per-device taps (e.g. tcpdump),
  2. evaluate the ingress policy (configured with tc),
  3. hand over the frame to the device-specific receive handler, if any,
  4. hand over the frame to a global or device-specific protocol handler (e.g. IPv4, ARP, IPv6).
For a bridged interface, the kernel has configured a device-specific receive handler, br_handle_frame(). This function won t allow any additional processing in the context of the incoming interface, except for STP and LLDP frames or if brouting is enabled1. Therefore, the protocol handlers are never executed in this case. After a few additional checks, Linux will decide if the frame has to be locally delivered:
  • the entry for the target MAC in the FDB is marked for local delivery, or
  • the target MAC is a broadcast or a multicast address.
In this case, the frame is passed to the br_pass_frame_up() function. A VLAN-related check is optionally performed. The socket buffer is attached to the bridge interface (br0) instead of the physical interface (eth0), is evaluated by Netfilter and sent back to netif_receive_skb(). It will go through the four steps a second time.

IPv4 processing When a device doesn t have a protocol-independent receive handler, a protocol-specific handler will be used:
# cat /proc/net/ptype
Type Device      Function
0800          ip_rcv
0011          llc_rcv [llc]
0004          llc_rcv [llc]
0806          arp_rcv
86dd          ipv6_rcv
Therefore, if the Ethernet type of the incoming frame is 0x800, the socket buffer is handled by ip_rcv(). Among other things, the three following steps will happen:
  • If the frame destination address is not the MAC address of the incoming interface, not a multicast one and not a broadcast one, the frame is dropped ( not for us ).
  • Netfilter gets a chance to evaluate the packet (in a PREROUTING chain).
  • The routing subsystem will decide the destination of the packet in ip_route_input_slow(): is it a local packet, should it be forwarded, should it be dropped, should it be encapsulated? Notably, the reverse-path filtering is done during this evaluation in fib_validate_source().
Reverse-path filtering (also known as uRPF, or unicast reverse-path forwarding, RFC 3704) enables Linux to reject traffic on interfaces which it should never have originated: the source address is looked up in the routing tables and if the outgoing interface is different from the current incoming one, the packet is rejected.

ARP processing When the Ethernet type of the incoming frame is 0x806, the socket buffer is handled by arp_rcv().
  • Like for IPv4, if the frame is not for us, it is dropped.
  • If the incoming device has the NOARP flag, the frame is dropped.
  • Netfilter gets a chance to evaluate the packet (configuration is done with arptables).
  • For an ARP request, the values of arp_ignore and arp_filter may trigger a drop of the packet.

IPv6 processing When the Ethernet type of the incoming frame is 0x86dd, the socket buffer is handled by ipv6_rcv().
  • Like for IPv4, if the frame is not for us, it is dropped.
  • If IPv6 is disabled on the interface, the packet is dropped.
  • Netfilter gets a chance to evaluate the packet (in a PREROUTING chain).
  • The routing subsystem will decide the destination of the packet. However, unlike IPv4, there is no reverse-path filtering2.

Workarounds There are various methods to fix the situation. We can completely ignore the bridged interfaces: as long as they are attached to the bridge, they cannot process any upper layer protocol (IPv4, IPv6, ARP). Therefore, we can focus on filtering incoming traffic from br0. It should be noted that for IPv4, IPv6 and ARP protocols, the MAC address check can be circumvented by using the broadcast MAC address.

Protocol-independent workarounds The four following fixes will indistinctly drop IPv4, ARP and IPv6 packets.

Using VLAN-aware bridge Linux 3.9 introduced the ability to use VLAN filtering on bridge ports. This can be used to prevent any local traffic:
# echo 1 > /sys/class/net/br0/bridge/vlan_filtering
# bridge vlan del dev br0 vid 1 self
# bridge vlan show
port    vlan ids
eth0     1 PVID Egress Untagged
eth2     1 PVID Egress Untagged
eth3     1 PVID Egress Untagged
eth4     1 PVID Egress Untagged
br0     None
This is the most efficient method since the frame is dropped directly in br_pass_frame_up().

Using ingress policy It s also possible to drop the bridged frame early after it has been re-delivered to netif_receive_skb() by br_pass_frame_up(). The ingress policy of an interface is evaluated before any handler. Therefore, the following commands will ensure no local delivery (the source interface of the packet is the bridge interface) happens:
# tc qdisc add dev br0 handle ffff: ingress
# tc filter add dev br0 parent ffff: u32 match u8 0 0 action drop
In my opinion, this is the second most efficient method.

Using ebtables Just before re-delivering the frame to netif_receive_skb(), Netfilter gets a chance to issue a decision. It s easy to configure it to drop the frame:
# ebtables -A INPUT --logical-in br0 -j DROP
However, to the best of my knowledge, this part of Netfilter is known to be inefficient.

Using namespaces Isolation can also be obtained by moving all the bridged interfaces into a dedicated network namespace and configure the bridge inside this namespace:
# ip netns add bridge0
# ip link set netns bridge0 eth0
# ip link set netns bridge0 eth2
# ip link set netns bridge0 eth3
# ip link set netns bridge0 eth4
# ip link del dev br0
# ip netns exec bridge0 brctl addbr br0
# for i in 0 2 3 4; do
>    ip netns exec bridge0 brctl addif br0 eth$i
>    ip netns exec bridge0 ip link set up dev eth$i
> done
# ip netns exec bridge0 ip link set up dev br0
The frame will still wander a bit inside the IP stack, wasting some CPU cycles and increasing the possible attack surface. But ultimately, it will be dropped.

Protocol-dependent workarounds Unless you require multiple layers of security, if one of the previous workarounds is already applied, there is no need to apply one of the protocol-dependent fix below. It s still interesting to know them because it is not uncommon to already have them in place.

ARP The easiest way to disable ARP processing on a bridge is to set the NOARP flag on the device. The ARP packet will be dropped as the very first step of the ARP handler.
# ip link set arp off dev br0
# ip l l dev br0
8: br0: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 50:54:33:00:00:04 brd ff:ff:ff:ff:ff:ff
arptables can also drop the packet quite early:
# arptables -A INPUT -i br0 -j DROP
Another way is to set arp_ignore to 2 for the given interface. The kernel will only answer to ARP requests whose target IP address is configured on the incoming interface. Since the bridge interface doesn t have any IP address, no ARP requests will be answered.
# sysctl -qw net.ipv4.conf.br0.arp_ignore=2
Disabling ARP processing is not a sufficient workaround for IPv4. A user can still insert the appropriate entry in its neighbor cache:
# ip neigh replace 192.168.14.3 lladdr 50:54:33:00:00:04 dev eth0
# ping -c 1 192.168.14.3
PING 192.168.14.3 (192.168.14.3) 56(84) bytes of data.
64 bytes from 192.168.14.3: icmp_seq=1 ttl=49 time=1.30 ms
--- 192.168.14.3 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.309/1.309/1.309/0.000 ms
As the check on the target MAC address is quite loose, they don t even need to guess the MAC address:
# ip neigh replace 192.168.14.3 lladdr ff:ff:ff:ff:ff:ff dev eth0
# ping -c 1 192.168.14.3
PING 192.168.14.3 (192.168.14.3) 56(84) bytes of data.
64 bytes from 192.168.14.3: icmp_seq=1 ttl=49 time=1.12 ms
--- 192.168.14.3 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.129/1.129/1.129/0.000 ms

IPv4 The earliest place to drop an IPv4 packet is with Netfilter3:
# iptables -t raw -I PREROUTING -i br0 -j DROP
If Netfilter is disabled, another possibility is to enable strict reverse-path filtering for the interface. In this case, since there is no IP address configured on the interface, the packet will be dropped during the route lookup:
# sysctl -qw net.ipv4.conf.br0.rp_filter=1
Another option is the use of a dedicated routing rule. Compared to the reverse-path filtering option, the packet will be dropped a bit earlier, still during the route lookup.
# ip rule add iif br0 blackhole

IPv6 Linux provides a way to completely disable IPv6 on a given interface. The packet will be dropped as the very first step of the IPv6 handler:
# sysctl -qw net.ipv6.conf.br0.disable_ipv6=1
Like for IPv4, it s possible to use Netfilter or a dedicated routing rule.

About the example In the above example, the virtual host get ICMP replies because they are routed through the infrastructure network to Internet (e.g. the hypervisor has a default gateway which also acts as a NAT router to Internet). This may not be the case. If you want to check if you are vulnerable despite not getting an ICMP reply, look at the guest neighbor table to check if you got an ARP reply from the host:
# ip route add 192.168.14.3/32 dev eth0
# ip neigh show dev eth0
192.168.14.3 lladdr 50:54:33:00:00:04 REACHABLE
If you didn t get a reply, you could still have issues with IP processing. Add a static neighbor entry before checking the next step:
# ip neigh replace 192.168.14.3 lladdr ff:ff:ff:ff:ff:ff dev eth0
To check if IP processing is enabled, check the bridge host s network statistics:
# netstat -s   grep "ICMP messages"
    15 ICMP messages received
    15 ICMP messages sent
    0 ICMP messages failed
If the counters are increasing, it is processing incoming IP packets. One-way communication still allows a lot of bad things, like DoS attacks. Additionally, if the hypervisor happens to also act as a router, the reach is extended to the whole infrastructure network, potentially exposing weak devices (e.g. PDU) exposing an SNMP agent. If one-way communication is all that s needed, the attacker can also spoof its source IP address, bypassing IP-based authentication.

  1. A frame can be forcibly routed (L3) instead of bridged (L2) by brouting the packet. This action can be triggered using ebtables.
  2. For IPv6, reverse-path filtering needs to be implemented with Netfilter, using the rpfilter match.
  3. If the br_netfilter module is loaded, net.bridge.bridge-nf-call-ipatbles sysctl has to be set to 0. Otherwise, you also need to use the physdev match to not drop IPv4 packets going through the bridge.

Next.

Previous.