Linux

If less is more, maybe nothing is the most¹. Since most appliances are just linux with a wrapper, let’s get rid of that wrapper. We’ll be left with more.

All humor aside, having the simplest thing possible that works is almost always the best solution. I’ve found after several years in production that a simple Debian router/traffic shaper is equal or better than any appliance.

Preparation

Network

The first problem is how to mock-up the WAN/LAN networks for this new router.

For the outside interface, let’s pretend that your existing LAN is the internet. Just plug eth0 in to whatever you’ve already got.

For the inside network, we will overlay a new LAN on top of your existing one. This is analogous to having different computers use the same physical wires but configuring different IP networks. They may see each others traffic at the physical layer, but will ignore each other logically.

OS

Create a new Debian instance with two interfaces, both on the existing LAN. On bare-metal, install a minimal Debian instance by using the netinst image and selecting only ‘common system tools’ near the bottom.

In a LAN/WAN situation eth0 is traditionally WAN². So leave the first interface with it’s default DHCP settings and configure the second one with a static address.

Assuming that your existing LAN is 192.168.1.* we’ll overlay our new LAN as 192.168.2.*

sudo vi /etc/systemd/network/eth1.network

[Match]
Name=eth1

[Network]
Address=192.168.2.1/24

sudo systemctl reload systemd-networkd

Installation

We’ll need the nft tools to do the basics and make sure curl is handy as well.

sudo apt install nftables curl

Configuration

Forwarding

The first step is to enable forwarding, the basic job of all routers.

# as root
echo "net.ipv4.ip_forward=1" >> /etc/sysctl.conf
sysctl -p

It’s worth knowing that the system (as is) doesn’t spend a lot of time thinking about where a packet came from. If someone gives it a packet, it consults the route table and sends it on. It doesn’t really care where it came from. This sounds dangerous, but in practice it’s rarely a problem. Though you can look into policy based routing if things are complicated.

Masquerade

If one side is a private network (such as 192.168.*) you probably need to masquerade. This is different than the type attended by the Opera Ghost.

sudo vi /etc/nftables.conf

This is the default debian file. Add WAN and LAN at the top and a table at the bottom with the masquerade details.

It’s traditional to name this table ’nat’ and the chain ‘postrouting’. Those names are arbitrary but make sense as that’s its type and where it’s hooked.

#!/usr/sbin/nft -f

flush ruleset

# Define WAN and LAN 
define WAN = eth0
define LAN = eth1

table inet filter {
	chain input {
		type filter hook input priority filter; policy accept;
	}
	chain forward {
		type filter hook forward priority filter; policy accept;
	}
	chain output {
		type filter hook output priority filter; policy accept;
	}
}
table ip nat {
    chain postrouting {
        type nat hook postrouting priority srcnat; policy accept;
               
        # if the output interface is WAN, masquerade the traffic.
        oif $WAN masquerade 
    }
}

systemctl enable --now nftables.service

Test

To see if things are working so far, take another system and replace it’s .1 LAN address with something on the .2 and add the new router as it’s gateway.

# Replace your existing IP 
sudo ip addr replace 192.168.2.50/24 dev eth0

# Replace your existing route
ip route replace default via 192.168.2.1 dev eth0

ping 8.8.8.8

Firewall

Right now, you’re letting everything in and out, shouting ‘packets want to be free!’. But this is the real world so let’s add firewall to this router’s list of jobs. This is still a router, though. So allow controlled ping and traceroute. This helps avoid fragmentation and provides diagnostic benefits that outweigh attempts at obscurity.

Default Block

We are going to add a chain and edit two others. Create the chain early_drop at the top to get rid of any obvious garbage in the early prerouting stage. Then edit the input (packets destined to the router itself), and the forward chains (packets being sent on).

table inet filter {    

    chain early_drop {
        type filter hook prerouting priority filter; policy accept;
        ct state invalid drop
    }

    chain input {

        # Change the default policy to drop
        type filter hook input priority 0; policy drop;

        # Allow local and already established connections 
        iif lo accept
        ct state established,related accept

        # Allow standard ping with rate limiting and traceroute
        icmp type echo-request limit rate 5/second accept
        icmp type { destination-unreachable, time-exceeded } accept

        # Respond to Linux traceroute
        iif $WAN udp dport 33434-33534 reject with icmp type port-unreachable 

        # IPv6 equivalents 
        icmpv6 type { echo-request, nd-neighbor-solicit, nd-neighbor-advert, packet-too-big, time-exceeded } accept
    }

    chain forward {

        # Change the default policy to drop 
        type filter hook forward priority filter; policy drop;

        # Accept local and already established connections 
        ct state established,related accept

        # Accept connections from LAN to WAN
        iif $LAN oif $WAN accept
    }

nft -f /etc/nftables.conf

Congrats! you’ve just secured things and possibly locked yourself out! (don’t close your ssh session yet)

Accept SSH

To allow remote management, append a SSH rule to the bottom if your input chain. It’s best to start with LAN for this

    chain input {
        ...
        ...

        # Accept from any source to SSH
        iif $LAN tcp dport "ssh" accept
    }

Limit by IP

You may want to limit access based on IPs. The simplest and more performant way is with a set. That’s basically a list of IPs and networks to compare incoming packets against. We’ll put the details in it’s own file to keep the main config clean.

sudo mkdir -p /etc/nftables/sets
sudo vi /etc/nftables/sets/work_ips.nft

table inet filter {
    set work_ips {
        type ipv4_addr
        flags interval
        elements = {
            1.2.3.4,        # Hot in Cleveland
            5.6.7.8,        # WKRP in Cincinnati
            10.11.12.0/24,  # Buzz Beer headquarters
        }
    }
}

Put an include right after the definitions and use it in a rule like this.

vi /etc/nftables.conf

#!/usr/sbin/nft -f

flush ruleset

# Define WAN and LAN 
define WAN = eth0
define LAN = eth1

# Include all the config files in the folder
include "/etc/nftables/sets/*.nft"

table inet filter {

    chain input {
        ...
        ...

        # Accept from any source to SSH
        iif $LAN tcp dport "ssh" accept
        
        # Accept from work to SSH
        iif $WAN ip saddr @work_ips tcp dport "ssh" accept
    }

Reload the rules and take a look. You’ll see your new set displayed.

nft -f /etc/nftables.conf

sudo nft list ruleset

Note: You may have noticed that both the main and included files contain a table inet filter { block. These don’t replace each other, rather they are additive. That’s a nft thing.

Limit by GeoIP

Sometimes you don’t know what your IP will be, but you’re pretty sure you’re not leaving the country. In addition, the vast majority of attacks come from just a few countries. The wirefalls/geo-nft script will help by downloading a map of IP’s to countries.

sudo mkdir -p /etc/nftables/geo-nft
cd /etc/nftables/geo-nft
curl -LO  https://raw.githubusercontent.com/wirefalls/geo-nft/master/geo-nft.sh
chmod +x geo-nft.sh
sudo ./geo-nft.sh

Create a us.nft that incorporates the ipv4 data like this.

sudo vi /etc/nftables/sets/us.nft

include "/etc/nftables/geo-nft/countrysets/US.ipv4" 

table inet filter {
    set US_ipv4 {
        type ipv4_addr
        flags interval
        auto-merge
        elements = $US.ipv4
    }
}

Make use of it just like any set.

...
...
# Accept connections from US to SSH
iif $WAN ip saddr @US_ipv4 tcp dport "ssh" accept
...
...

Reload and observe the now much larger set. But don’t let the size scare you. I’m told speed is the same for ten or ten thousand as they are hashed.

nft -f /etc/nftables.conf
nft list ruleset | more

If you see an error about the Message size like this:

netlink: Error: Could not process rule: Message too long Please, rise /proc/sys/net/core/wmem_max on the host namespace. Hint: 4194304 bytes

You’ll need to increase your netlink buffer size. Do this on the host if running in an instance or container. Jumping up to a 4M buffer should insulate you from needing to change it again even as your set sizes change.

sudo vi /etc/sysctl.d/99-nftables-netlink.conf

net.core.wmem_max=4194304
net.core.wmem_default=524288

sudo sysctl -p /etc/sysctl.d/99-nftables-netlink.conf

Combining Sets

Sometimes you want to combine sets, like the worst 5 countries for cyber-attacks. The unfab 5. Just include them all with a semicolon after each include, and you can flatten them into a super-set. (That’s a hidden gotcha - the files don’t end with a newline and using ‘;’ isn’t well known)

sudo vi /etc/nftables/sets/unfab.nft

include "/etc/nftables/geo-nft/countrysets/CN.ipv4";
include "/etc/nftables/geo-nft/countrysets/IN.ipv4";
include "/etc/nftables/geo-nft/countrysets/NE.ipv4";
include "/etc/nftables/geo-nft/countrysets/NG.ipv4"; 
include "/etc/nftables/geo-nft/countrysets/RU.ipv4"; 

define UNFAB = { $CN.ipv4, $IN.ipv4, $NE.ipv4, $NG.ipv4, $RU.ipv4 }

table inet filter {
    set UNFAB_ipv4 {
        type ipv4_addr
        flags interval
        auto-merge
        elements = $UNFAB
    }
}

Inverting Sets

When you want to accept connections from everyplace and just exclude a few countries, it’s a lot easier to just list the exclusions. You do that with an inverse meaning accept any connection not on this list. If you’ve created the unfab set above, you’d use it inversely like this:

# Accept everyone but the unfab to connect to SSH
ip saddr != @UNFAB_ipv4 tcp dport "ssh" accept

Port Forwarding

Let’s say we are going to forward web traffic to a back-end server. This actually requires two things; an new prerouting chain and entry to catch incoming traffic to those ports and rewrite it with the back-end server’s address, and a forward chain entry where you accept the traffic.

sudo mkdir /etc/nftables/forwards

sudo vi /etc/nftables/forwards/web_server.nft

define WEB_SRV = 192.168.2.10

table ip nat {
    chain prerouting {
        type nat hook prerouting priority dstnat;

        # Rewrite any incoming traffic so they have the web servers address instead
        iif $WAN tcp dport { "http", "https" } dnat to $WEB_SRV
    }
}

table inet filter {
    chain forward {
        type filter hook forward priority filter;

        # Accept the web server traffic so it can be forwarded on.
        iif $WAN ip daddr $WEB_SRV tcp dport { "http", "https" } accept

        # Or maybe only accept traffic from the US
        iif $WAN ip saddr @US_ipv4 ip daddr $WEB_SRV tcp dport { "http", "https" } accept        
    }
}

Add an include in the main file after the sets and reload.

...
...

# Include any sets
include "/etc/nftables/sets/*.nft"

# Include any port forwards
include "/etc/nftables/forwards/*.nft"

...
...

nft -f /etc/nftables.conf

Change the IP and gateway on your web server temporarily and it should test correctly.

Note: It’s tempting to add more rules to the prerouting section. Resist this urge, however. According to the docs, some parts of a flow skip prerouting so you should “…never use this chain for filtering”³. The only exception being to drop obvious junk as we did at the beginning.

Port Rewriting

Say you’re forwarding the external port 222 to an internal server on port 22. It’s just a matter of adding that to the rewrite. The forward chain stays the same.

redefine EXTERNAL_PORT = 222
redefine INTERNAL_IP   = 192.168.2.4
redefine INTERNAL_PORT = 22

table ip nat {
    chain prerouting {
        type nat hook prerouting priority dstnat;
        iif $WAN tcp dport $EXTERNAL_PORT dnat to $INTERNAL_IP:$INTERNAL_PORT
    }
}
table inet filter {
    chain forward {
        type filter hook forward priority filter;
        iif $WAN ip daddr $INTERNAL_IP tcp dport $INTERNAL_PORT accept
    }
}

Note: Last time we defined the (probably) unique name for the variable of WEB_SRV, so that it wouldn’t conflict with other variables in other includes. But you can also just use redefine in a template and save a lot of editing.

Flow and Hardware Acceleration

Linux can bypass most of the nftables/conntrack path after a flow is established. If you have a decent ethernet card you can even get hardware acceleration.

Add A Flow Table

A flowtable is kernel-managed acceleration table you can add to your normal inet fitler table. It’s a “fastpath” that you can toss flows to that you’ve already looked at. This saves you from checking every packet when it’s already part of an established flow.

This will mostly offload web browsing/streaming and other normal TCP/UDP NAT traffic.

Add this at the top of the table inet filter.

table inet filter {

    flowtable ft {
            hook ingress priority filter;
            devices = { $LAN, $WAN };
    }
    ...
    ...

Then modify the forward chain to use it.

chain forward {

        type filter hook forward priority filter; policy drop;

        # Offload established/related flows
        ct state established,related flow add @ft accept

        # This line is now redundant but usually left in so you can 
        # comment the above hardware accel in and out easily
        ct state established,related accept

        # Accept connections from LAN to WAN
        iif $LAN oif $WAN accept
}

If you want to dig in more, you can add a counter to the rules above. Comment out the Offload rule and the flow_normal counter should increase steadily. Swap and the flow_offload should increase much more slowly as only the first packets get counted. The rest of the low bypasses.

        ct state established,related flow add @ft counter name flow_offload_counter accept
        ct state established,related accept counter name flow_normal_counter accept

Enable Hardware Offloading

Pro cards from Intel, Mellinox, Chelsio and others make drivers that provide hardware acceleration to the kernel. Check with this comand.

ethtool -k eth0 | grep offload

You’re mostly looking for hw-tc-offload: on as nftables flow offload can sometimes leverage TC hardware offload underneath. Enable the hardware offload with the flag offload.

flowtable ft {
        hook ingress priority filter;
        devices = { $LAN, $WAN };
        flags offload;
}

Note: if you saw l2-fwd-offload: on then you have a NIC that does autonomous Layer-2 forwarding/switching and you either don’t need to read this page, or need to follow up with another more advanced page!

eBGP and XDP

If this isn’t fast enough, you can use eBGP to take action on packets before the kernel starts working on them. If your card supports XDP, you can do that while they are still in the network card!

This is mostly useful for dropping trash and mitigating denial of service attacks without bothering the CPU. There’s also a case where you move packets around in a kubernetes cluster without looking at them closely for the sake of speed. Though Cloudflare utilizes the xt_bpf extension to embed eBPF programs with nftables, it’s not easy.

It’s not generally useful as in most cases you’ll have to look at the details of the packet anyway. In this case the best you can do is a flowtable you just created

Though if you’re under DDOS attacks consider xdp-filter and for large-scale load balancing you might look the other eBFP solutions.

If you’re running in a container of some kind, you can pass a card directly to the instance (SR-IOV or full PCI) to take advantage of it.

High Availability

This best done with a pair of routers that share a virtual IP managed by keepalived. It’s fairly simple to deploy as you just:

Set up the first router with all the configuration you need. (let’s assume you’ve done this already)
Install keepalived with a simple config.
Clone and tweak so the IPs, Hostname, Config works.

Change Admin IPs

Your router is still has a DHCP WAN address and that’s fine. But you’ll need to select something static for the virtual WAN ip address you’re about to create. I’ve used 192.168.1.2, but adjust as needed.

Install Keepalived

# Just install the core server, without perl and ipvsadm
sudo apt install --no-install-recommends  keepalived

Create a Virtual LAN and WAN Address

sudo vi /etc/keepalived/keepalived.conf

You’ll note that we’ve set the state to “BACKUP” and we’ll use that on both. This keeps them from flapping during any intermittent issues.

vrrp_instance WAN {
    state BACKUP
    interface eth0
    virtual_router_id 51
    virtual_ipaddress { 192.168.1.2/24 }
}

vrrp_instance LAN {
    state BACKUP
    interface eth1
    virtual_router_id 52
    virtual_ipaddress { 192.168.2.1/24 }
}

Add Firewall Rules

Add this at the bottom of your chain input section so you accept vrrp traffic. 224.0.0.18 is a reserved IPv4 multicast address specifically designated for the Virtual Router Redundancy Protocol

...
...

table inet filter {

    chain input {

        ...
        ...

        # Allow VRRP on both relevant interfaces. The 224 is
        iif { $WAN, $LAN } ip protocol vrrp accept
        iif { $WAN, $LAN } ip daddr 224.0.0.18 accept
    }
}

Test and Clone

Start the service and observe the log to see the status messages.

sudo systemctl start keepalived.service

sudo journalctl -u keepalived*

If things go well, you’ll see it start as backup, then quickly enter master when it doesn’t find any peers to say otherwise. An ip a will show the virtual IPs in place.

Clone the box (or the config), change the LAN IP so it doesn’t conflict, and bring both up. You can experiment by taking the service up and down to watch the IP shift.

Active vs Passive

This is an Active/Passive arrangement. For the most part, there’s no advantage to Active/Active. Either system can move traffic faster than the uplink permits. If you do have more traffic than either one can move on it’s own, you can’t take one down.

The main benefit of Active/Active is that you know it works. You don’t want to be surprised during an outage. It allows you to gracefully drain traffic for controlled maintenance. Also, you can’t always get a 200G system.

To setup Active/Active for capacity you create a team of three or more. Each is MASTER of a separate VIP and they all back each other up. DHCP passes out the gateway addresses round-robin to balance load. Keep in mind that traffic shaping with tc happens after nftables, so you can’t shape your way out of a capacity issue.

Keeping Them Synced

How do you keep them in sync? You can probably just remember to update rules only on router-1 and then run rsync. Though you’ll need to grant yourself (or a service account) permissions.

# On both routers

## Change the owner of the nftable data
sudo chown -R ${USER} /etc/nftables*

## Allow yourself to run nft without a password prompt so you can automate the reload
echo "$USER ALL=(root) NOPASSWD: /usr/sbin/nft" | sudo tee /etc/sudoers.d/nft
sudo chmod 440 /etc/sudoers.d/nft

# On your primary router

## Generate a key and copy it to the other router (make sure dns resolves it correctly)
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N ''
ssh-copy-id router-2

## Run a sync 
rsync --archive --delete --inplace --verbose  /etc/nftables.conf /etc/nftables router-2:/etc/

## Trigger a reload on the other router
ssh router-2 sudo nft -f /etc/nftables.conf

You can probably put those last two commands into the bash file go-go-geo-nft to save typing them out. Even better, create a systemd watcher that detects changes and syncs automatically.

Election and Priority

The IP will only move when one system goes offline. If both come back up perfectly at once, they’ll have an election and the highest management IP wins. You can override this with an explicit priority or script it if you have reason to.

You can also change one to state MASTER and it will take over the role whenever it’s up. This is useful when you have a hardware preference or are running additional software that can’t do multi-master.

Elections are held via broadcast (the 224 address) but unicast is also supported if you’re in a cloud environment that blocks broadcast

If you have other software running and want to bind it to the virtual ip before it becomes instantiated, look at adding net.ipv4.ip_nonlocal_bind=1 to the /etc/sysctl.conf. Though for things like dnsmasq it’s not needed.

Connection Tracking

The only other issue with failing over is connection state. When you fail over, any session dependant applications (like SSH) or games get dropped. This isn’t a large issue as the web is (mostly) stateless. But you can deploy conntrackd if desired.

Scaling Up

Single Device

These days, you can Expect 100–400 concurrent connections per device at peak so you outpace a single device faster than you think. On a single device, you can optimize some kernetl settings.

# Create a drop-in configuration file in /etc/sysctl.d for these settings
net.core.netdev_max_backlog=250000
net.core.somaxconn=65535
net.ipv4.ip_local_port_range=1024 65535
net.ipv4.tcp_tw_reuse=1

net.netfilter.nf_conntrack_max=2000000
net.netfilter.nf_conntrack_tcp_timeout_established=86400
net.netfilter.nf_conntrack_generic_timeout=120

Smart Switches

On a old-style campus WAN, you have (expensive) layer 3 switches and routers that use the ECMP protocol to use multiple paths. This has wider compatibility than agregating ports with LAG. You’d think you could just distrubute multiple gateways as part of DHCP, but clients handle that randomly or not at all.

Become More Active

A more modern and cost effective strategy is to move from an Active/Passive HA pair, to an Active/Active/Active set of peers.

Create 3 or more routers
Create 3 or more VRRP instances, where a different router is MASTER in each and the others are BACKUP
Distrubute gateway via DHCP in round-robin to distribute load

IPv6 To The Rescue

If you can handle the new however, use and IPv6‑first design.

Design a full IPv6‑first + IPv4 fallback addressing plan
Browsers attempt IPv6 first
Mobile OSes heavily favor IPv6
Major content providers publish IPv6 natively
Less NAT to handle.

DNS and DHCP

DHCP and DNS are optional, but sometimes expected in small environments and frequently under other duties as assigned for such systems. Dnsmasq is the go-to for most. I’ve found that particular software doesn’t scale well beyond a few thousand clients, but if you’ve gotten to that point, you’d want a dedicated system anyway. So it’s fine for small scale.

This is covered in many other locations but I’ll toss a sample config here just in case.

# Remove systemd-resolved if installed by the default debian instance
sudo systemctl stop systemd-resolved.service 
sudo apt install dnsmasq

DNS

You’ll need to accept DNS requests by adding another accept at the bottom. UDP is traditional, but TCP is also used these days. Add this at the bottom of your input chain.

table inet filter {
    chain input {
        ...
        ...

        # Allow DNS queries 
        iif $LAN udp dport "domain" accept
        iif $LAN tcp dport "domain" accept
    }
}

By default, dnsmasq just fires up and starts work by caching DNS queries. It looks at your /etc/resolv/conf to see where to forward requests and your /etc/hosts for known hosts. You may want to tighten things up a bit with these.

sudo vi /etc/dnsmasq.conf

Just add these to the bottom.

# Don't forward plain names or non-routable IPs
domain-needed
bogus-priv

# Listen only on the LAN 
interface=lo
interface=eth1          

# Upstream DNS servers to forward requests to
server=8.8.8.8
server=8.8.4.4

# Cache size (1000 is a good start)
cache-size=1000

# Don't use the hosts file for entries
no-hosts

Create a seperate file for host entries as well. This will let you sync more easily later. Since debian’s packaging of dnsmasq makes use of a drop folder, let’s go ahead and embrace that and add entries in the native format.

sudo vi /etc/dnsmasq.d/hosts.conf

host-record=gateway,192.168.2.1
host-record=router-1,192.168.2.2
host-record=router-2,192.168.2.3
host-record=some-host,192.168.2.10
host-record=some-other-host,192.168.2.11
...
...

# Restart so dnsmasq reads the new config file.
sudo systemctl dnsmasq restart

# Tell the system to use itself for resolution as well.
echo "nameserver 127.0.0.1" | sudo tee /etc/resolv.conf

If you went to the trouble to set up sync above for nft, you can piggyback on that for dns.

# On both boxes
sudo test -f /etc/dnsmasq.d/hosts.conf || sudo touch /etc/dnsmasq.d/hosts.conf
sudo chgrp -R ${USER} /etc/dnsmasq.d/hosts.conf
sudo chmod -R 775 /etc/dnsmasq.d/hosts.conf
echo "$USER ALL=(root) NOPASSWD: /bin/systemctl restart dnsmasq, /bin/systemctl reload dnsmasq" | sudo tee /etc/sudoers.d/dnsmasq
sudo chmod 440 /etc/sudoers.d/dnsmasq

# On router-1, you can send the file and restart (reload doesn't load date from host.conf)
rsync --archive --delete --inplace --no-times --verbose  /etc/dnsmasq.d/hosts.conf router-2:/etc/dnsmasq.d/
ssh router-2 sudo systemctl restart dnsmasq


### DHCP

You can pass out addresses as well. Just follow up the last bit with this. Though you may want to leave these commented out until your ready, to avoid confusing the overlay networks.

```bash
dhcp-range=192.168.2.100,192.168.2.200,12h
dhcp-option=option:router,192.168.2.1
dhcp-authoritative

Bonus Points

You have a couple of things worth automating. Changes to nft rules and updates to the GeoIP database.

Automate GeoIP

Those IP sets are refreshed the 1st of every month on db-ip.com. We should refresh accordingly. On Debian 13, systemd timers are preferred (cron isn’t even installed).

sudo vi /etc/systemd/system/geo-nft-update.service

[Unit]
Description=Update geo-nft IP datasets
Wants=network-online.target
After=network-online.target

[Service]
Type=oneshot
ExecStart=/etc/nftables/geo-nft/geo-nft.sh

sudo nano /etc/systemd/system/geo-nft-update.timer

[Unit]
Description=Monthly geo-nft update (3rd day)

[Timer]
OnCalendar=*-*-03 03:30:00
Persistent=true
RandomizedDelaySec=30m

[Install]
WantedBy=timers.target

# Just enable the timer. We don't want the service to run unless the timer starts it.
sudo systemctl daemon-reload
sudo systemctl enable --now geo-nft-update.timer

Config Sync

You can also use systemd to watch for file changes and trigget a sync and reload. Note the User= setting. This must be your account (or whatever account you used in setting up the sync and ssh above) to work as the manual process did. Alternatly, create a service account.

sudo vi /etc/systemd/system/router-sync.service

[Unit]
Description=Sync router configuration to router-2
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
User=YOU

ExecStart=/usr/bin/sudo /usr/sbin/nft -f /etc/nftables.conf
ExecStart=/usr/bin/rsync --archive --delete --inplace --no-times /etc/nftables.conf /etc/nftables router-2:/etc/
ExecStart=/usr/bin/ssh router-2 sudo nft -f /etc/nftables.conf

ExecStart=/usr/bin/sudo systemctl reload dnsmasq
ExecStart=/usr/bin/rsync --archive --delete --inplace --no-times /etc/dnsmasq.d/hosts.conf router-2:/etc/dnsmasq.d/
ExecStart=/usr/bin/ssh router-2 sudo systemctl reload dnsmasq

sudo vi /etc/systemd/system/router-sync.path

[Unit]
Description=Watch router config files

[Path]
PathChanged=/etc/nftables.conf
PathModified=/etc/nftables
PathChanged=/etc/dnsmasq.d/hosts.conf

Unit=router-sync.service

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable --now router-sync.path

I put both services in this watcher as an example. A a better design is a sperate watch for each to reduce curn if you have frequent changes.

Replace GeoIP

The tool we’re using downloads the full set of countries as a .csv and then converts them into nft definitions. It’s a bit slow and heavy. You can replace it with something like this and get a more compact (and prehapse better) CIDR format like this.

COUNTRY=ad
OUT=/etc/nftables/sets/${COUNTRY}.nft

tmp=$(mktemp)

{
    echo "set ${COUNTRY}_ipv4 {"
    echo "    type ipv4_addr"
    echo "    flags interval"
    echo "    elements = {"

    curl -s https://www.ipdeny.com/ipblocks/data/countries/${COUNTRY}.zone | \
        awk '{print "        "$1","}' | sed '$ s/,$//'

    echo "    }"
    echo "}"
} > "$tmp"

mv "$tmp" "$OUT"

Troubleshooting

In a container, forwarding just stops. Restarting nftables has no affect but restarting systemd-networkd fixes it.

Incus especially likes to replace the bridge interfaces periodically.

sudo journalctl -u systemd-networkd -u NetworkManager --since "1 day ago"

May 28 12:55:50 shire1 systemd-networkd[486]: physSf5Ar4: Interface name change detected, renamed to vethe8067f8a.
May 28 12:55:50 shire1 systemd-networkd[486]: physnLFcl3: Interface name change detected, renamed to vethd880fdd5.
May 28 12:55:50 shire1 systemd-networkd[486]: vethc3f8a16b: Link UP
May 28 12:55:50 shire1 systemd-networkd[486]: vethc3f8a16b: Link DOWN
May 28 12:55:50 shire1 systemd-networkd[486]: veth243c40db: Link UP
May 28 12:55:50 shire1 systemd-networkd[486]: veth243c40db: Link DOWN
May 28 12:55:51 shire1 systemd-networkd[486]: veth24f2e389: Link UP
May 28 12:55:51 shire1 systemd-networkd[486]: vethff99fb0d: Link UP
May 28 12:55:51 shire1 systemd-networkd[486]: veth24f2e389: Gained carrier
May 28 12:55:51 shire1 systemd-networkd[486]: vethff99fb0d: Gained carrier

This causes a loss of static mappings to the interfaces and this breaks how iff and off in your rules work. You must instead us iffname which is dynamically mapped. It’s slightly slower, but needed in some containers.

sudo sed -i.bak -E \
    -e 's/\biif (\$WAN|\$LAN|\{ \$WAN)/iifname \1/g' \
    -e 's/\boif (\$WAN|\$LAN|\{ \$WAN)/oifname \1/g' \
    /etc/nftables.conf

I’ve read Rem Koolhaas meant this as mockey, so let’s think of it as “Simplify, then add lightness” as Chapman would say ↩︎
I don’t have a good attribution for this other than the many articles that assert this too ↩︎
https://wiki.nftables.org/wiki-nftables/index.php/Configuring_chains#Base_chain_types ↩︎

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified May 28, 2026: Updated linux firewall for conatiner iffname (a7a9fa5)