Linux
If less is more, maybe nothing is the most1. Since most appliances are just linux with a wrapper, let’s get rid of that wrapper. We’ll be left with more.
All humor aside, having the simplest thing possible that works is almost always the best solution. I’ve found after several years in production that a simple Debian router/traffic shaper is equal or better than any appliance.
Preparation
Network
The first problem is how to mock-up the WAN/LAN networks for this new router.
For the outside interface, let’s pretend that your existing LAN is the internet. Just plug eth0 in to whatever you’ve already got.
For the inside network, we will overlay a new LAN on top of your existing one. This is analogous to having different computers use the same physical wires but configuring different IP networks. They may see each others traffic at the physical layer, but will ignore each other logically.
OS
Create a new Debian instance with two interfaces, both on the existing LAN. On bare-metal, install a minimal Debian instance by using the netinst image and selecting only ‘common system tools’ near the bottom.
In a LAN/WAN situation eth0 is traditionally WAN2. So leave the first interface with it’s default DHCP settings and configure the second one with a static address.
Assuming that your existing LAN is 192.168.1.* we’ll overlay our new LAN as 192.168.2.*
sudo vi /etc/systemd/network/eth1.network
[Match]
Name=eth1
[Network]
Address=192.168.2.1/24
sudo systemctl reload systemd-networkd
Installation
We’ll need the nft tools to do the basics and make sure curl is handy as well.
sudo apt install nftables curl
Configuration
Forwarding
The first step is to enable forwarding, the basic job of all routers.
# as root
echo "net.ipv4.ip_forward=1" >> /etc/sysctl.conf
sysctl -p
It’s worth knowing that the system (as is) doesn’t spend a lot of time thinking about where a packet came from. If someone gives it a packet, it consults the route table and sends it on. It doesn’t really care where it came from. This sounds dangerous, but in practice it’s rarely a problem. Though you can look into policy based routing if things are complicated.
Masquerade
If one side is a private network (such as 192.168.*) you probably need to masquerade. This is different than the type attended by the Opera Ghost.
sudo vi /etc/nftables.conf
This is the default debian file. Add WAN and LAN at the top and a table at the bottom with the masquerade details.
It’s traditional to name this table ’nat’ and the chain ‘postrouting’. Those names are arbitrary but make sense as that’s its type and where it’s hooked.
#!/usr/sbin/nft -f
flush ruleset
# Define WAN and LAN
define WAN = eth0
define LAN = eth1
table inet filter {
chain input {
type filter hook input priority filter; policy accept;
}
chain forward {
type filter hook forward priority filter; policy accept;
}
chain output {
type filter hook output priority filter; policy accept;
}
}
table ip nat {
chain postrouting {
type nat hook postrouting priority srcnat; policy accept;
# if the output interface is WAN, masquerade the traffic.
oif $WAN masquerade
}
}
systemctl enable --now nftables.service
Test
To see if things are working so far, take another system and replace it’s .1 LAN address with something on the .2 and add the new router as it’s gateway.
# Replace your existing IP
sudo ip addr replace 192.168.2.50/24 dev eth0
# Replace your existing route
ip route replace default via 192.168.2.1 dev eth0
ping 8.8.8.8
Firewall
Right now, you’re letting everything in and out, shouting ‘packets want to be free!’. But this is the real world so let’s add firewall to this router’s list of jobs. This is still a router, though. So allow controlled ping and traceroute. This helps avoid fragmentation and provides diagnostic benefits that outweigh attempts at obscurity.
Default Block
We are going to add a chain and edit two others. Create the chain early_drop at the top to get rid of any obvious garbage in the early prerouting stage. Then edit the input (packets destined to the router itself), and the forward chains (packets being sent on).
table inet filter {
chain early_drop {
type filter hook prerouting priority filter; policy accept;
ct state invalid drop
}
chain input {
# Change the default policy to drop
type filter hook input priority 0; policy drop;
# Allow local and already established connections
iif lo accept
ct state established,related accept
# Allow standard ping with rate limiting and traceroute
icmp type echo-request limit rate 5/second accept
icmp type { destination-unreachable, time-exceeded } accept
# Respond to Linux traceroute
iif $WAN udp dport 33434-33534 reject with icmp type port-unreachable
# IPv6 equivalents
icmpv6 type { echo-request, nd-neighbor-solicit, nd-neighbor-advert, packet-too-big, time-exceeded } accept
}
chain forward {
# Change the default policy to drop
type filter hook forward priority filter; policy drop;
# Accept local and already established connections
ct state established,related accept
# Accept connections from LAN to WAN
iif $LAN oif $WAN accept
}
nft -f /etc/nftables.conf
Congrats! you’ve just secured things and possibly locked yourself out! (don’t close your ssh session yet)
Accept SSH
To allow remote management, append a SSH rule to the bottom if your input chain. It’s best to start with LAN for this
chain input {
...
...
# Accept from any source to SSH
iif $LAN tcp dport "ssh" accept
}
Limit by IP
You may want to limit access based on IPs. The simplest and more performant way is with a set. That’s basically a list of IPs and networks to compare incoming packets against. We’ll put the details in it’s own file to keep the main config clean.
sudo mkdir -p /etc/nftables/sets
sudo vi /etc/nftables/sets/work_ips.nft
table inet filter {
set work_ips {
type ipv4_addr
flags interval
elements = {
1.2.3.4, # Hot in Cleveland
5.6.7.8, # WKRP in Cincinnati
10.11.12.0/24, # Buzz Beer headquarters
}
}
}
Put an include right after the definitions and use it in a rule like this.
vi /etc/nftables.conf
#!/usr/sbin/nft -f
flush ruleset
# Define WAN and LAN
define WAN = eth0
define LAN = eth1
# Include all the config files in the folder
include "/etc/nftables/sets/*.nft"
table inet filter {
chain input {
...
...
# Accept from any source to SSH
iif $LAN tcp dport "ssh" accept
# Accept from work to SSH
iif $WAN ip saddr @work_ips tcp dport "ssh" accept
}
Reload the rules and take a look. You’ll see your new set displayed.
nft -f /etc/nftables.conf
sudo nft list ruleset
Note: You may have noticed that both the main and included files contain a table inet filter { block. These don’t replace each other, rather they are additive. That’s a nft thing.
Limit by GeoIP
Sometimes you don’t know what your IP will be, but you’re pretty sure you’re not leaving the country. In addition, the vast majority of attacks come from just a few countries. The wirefalls/geo-nft script will help by downloading a map of IP’s to countries.
sudo mkdir -p /etc/nftables/geo-nft
cd /etc/nftables/geo-nft
curl -LO https://raw.githubusercontent.com/wirefalls/geo-nft/master/geo-nft.sh
chmod +x geo-nft.sh
sudo ./geo-nft.sh
Create a us.nft that incorporates the ipv4 data like this.
sudo vi /etc/nftables/sets/us.nft
include "/etc/nftables/geo-nft/countrysets/US.ipv4"
table inet filter {
set US_ipv4 {
type ipv4_addr
flags interval
auto-merge
elements = $US.ipv4
}
}
Make use of it just like any set.
...
...
# Accept connections from US to SSH
iif $WAN ip saddr @US_ipv4 tcp dport "ssh" accept
...
...
Reload and observe the now much larger set. But don’t let the size scare you. I’m told speed is the same for ten or ten thousand as they are hashed.
nft -f /etc/nftables.conf
nft list ruleset | more
If you see an error about the Message size like this:
netlink: Error: Could not process rule: Message too long Please, rise /proc/sys/net/core/wmem_max on the host namespace. Hint: 4194304 bytes
You’ll need to increase your netlink buffer size. Do this on the host if running in an instance or container. Jumping up to a 4M buffer should insulate you from needing to change it again even as your set sizes change.
sudo vi /etc/sysctl.d/99-nftables-netlink.conf
net.core.wmem_max=4194304
net.core.wmem_default=524288
sudo sysctl -p /etc/sysctl.d/99-nftables-netlink.conf
Combining Sets
Sometimes you want to combine sets, like the worst 5 countries for cyber-attacks. The unfab 5. Just include them all with a semicolon after each include, and you can flatten them into a super-set. (That’s a hidden gotcha - the files don’t end with a newline and using ‘;’ isn’t well known)
sudo vi /etc/nftables/sets/unfab.nft
include "/etc/nftables/geo-nft/countrysets/CN.ipv4";
include "/etc/nftables/geo-nft/countrysets/IN.ipv4";
include "/etc/nftables/geo-nft/countrysets/NE.ipv4";
include "/etc/nftables/geo-nft/countrysets/NG.ipv4";
include "/etc/nftables/geo-nft/countrysets/RU.ipv4";
define UNFAB = { $CN.ipv4, $IN.ipv4, $NE.ipv4, $NG.ipv4, $RU.ipv4 }
table inet filter {
set UNFAB_ipv4 {
type ipv4_addr
flags interval
auto-merge
elements = $UNFAB
}
}
Inverting Sets
When you want to accept connections from everyplace and just exclude a few countries, it’s a lot easier to just list the exclusions. You do that with an inverse meaning accept any connection not on this list. If you’ve created the unfab set above, you’d use it inversely like this:
# Accept everyone but the unfab to connect to SSH
ip saddr != @UNFAB_ipv4 tcp dport "ssh" accept
Port Forwarding
Let’s say we are going to forward web traffic to a back-end server. This actually requires two things; an new prerouting chain and entry to catch incoming traffic to those ports and rewrite it with the back-end server’s address, and a forward chain entry where you accept the traffic.
sudo mkdir /etc/nftables/forwards
sudo vi /etc/nftables/forwards/web_server.nft
define WEB_SRV = 192.168.2.10
table ip nat {
chain prerouting {
type nat hook prerouting priority dstnat;
# Rewrite any incoming traffic so they have the web servers address instead
iif $WAN tcp dport { "http", "https" } dnat to $WEB_SRV
}
}
table inet filter {
chain forward {
type filter hook forward priority filter;
# Accept the web server traffic so it can be forwarded on.
iif $WAN ip daddr $WEB_SRV tcp dport { "http", "https" } accept
# Or maybe only accept traffic from the US
iif $WAN ip saddr @US_ipv4 ip daddr $WEB_SRV tcp dport { "http", "https" } accept
}
}
Add an include in the main file after the sets and reload.
...
...
# Include any sets
include "/etc/nftables/sets/*.nft"
# Include any port forwards
include "/etc/nftables/forwards/*.nft"
...
...
nft -f /etc/nftables.conf
Change the IP and gateway on your web server temporarily and it should test correctly.
Note: It’s tempting to add more rules to the prerouting section. Resist this urge, however. According to the docs, some parts of a flow skip prerouting so you should “…never use this chain for filtering”3. The only exception being to drop obvious junk as we did at the beginning.
Port Rewriting
Say you’re forwarding the external port 222 to an internal server on port 22. It’s just a matter of adding that to the rewrite. The forward chain stays the same.
redefine EXTERNAL_PORT = 222
redefine INTERNAL_IP = 192.168.2.4
redefine INTERNAL_PORT = 22
table ip nat {
chain prerouting {
type nat hook prerouting priority dstnat;
iif $WAN tcp dport $EXTERNAL_PORT dnat to $INTERNAL_IP:$INTERNAL_PORT
}
}
table inet filter {
chain forward {
type filter hook forward priority filter;
iif $WAN ip daddr $INTERNAL_IP tcp dport $INTERNAL_PORT accept
}
}
Note: Last time we defined the (probably) unique name for the variable of WEB_SRV, so that it wouldn’t conflict with other variables in other includes. But you can also just use redefine in a template and save a lot of editing.
Flow and Hardware Acceleration
Linux can bypass most of the nftables/conntrack path after a flow is established. If you have a decent ethernet card you can even get hardware acceleration.
Add A Flow Table
A flowtable is kernel-managed acceleration table you can add to your normal inet fitler table. It’s a “fastpath” that you can toss flows to that you’ve already looked at. This saves you from checking every packet when it’s already part of an established flow.
This will mostly offload web browsing/streaming and other normal TCP/UDP NAT traffic.
Add this at the top of the table inet filter.
table inet filter {
flowtable ft {
hook ingress priority filter;
devices = { $LAN, $WAN };
}
...
...
Then modify the forward chain to use it.
chain forward {
type filter hook forward priority filter; policy drop;
# Offload established/related flows
ct state established,related flow add @ft accept
# This line is now redundant but usually left in so you can
# comment the above hardware accel in and out easily
ct state established,related accept
# Accept connections from LAN to WAN
iif $LAN oif $WAN accept
}
If you want to dig in more, you can add a counter to the rules above. Comment out the Offload rule and the flow_normal counter should increase steadily. Swap and the flow_offload should increase much more slowly as only the first packets get counted. The rest of the low bypasses.
ct state established,related flow add @ft counter name flow_offload_counter accept
ct state established,related accept counter name flow_normal_counter accept
Enable Hardware Offloading
Pro cards from Intel, Mellinox, Chelsio and others make drivers that provide hardware acceleration to the kernel. Check with this comand.
ethtool -k eth0 | grep offload
You’re mostly looking for hw-tc-offload: on as nftables flow offload can sometimes leverage TC hardware offload underneath. Enable the hardware offload with the flag offload.
flowtable ft {
hook ingress priority filter;
devices = { $LAN, $WAN };
flags offload;
}
Note: if you saw l2-fwd-offload: on then you have a NIC that does autonomous Layer-2 forwarding/switching and you either don’t need to read this page, or need to follow up with another more advanced page!
eBGP and XDP
If this isn’t fast enough, you can use eBGP to take action on packets before the kernel starts working on them. If your card supports XDP, you can do that while they are still in the network card!
This is mostly useful for dropping trash and mitigating denial of service attacks without bothering the CPU. There’s also a case where you move packets around in a kubernetes cluster without looking at them closely for the sake of speed. Though Cloudflare utilizes the xt_bpf extension to embed eBPF programs with nftables, it’s not easy.
It’s not generally useful as in most cases you’ll have to look at the details of the packet anyway. In this case the best you can do is a flowtable you just created
Though if you’re under DDOS attacks consider xdp-filter and for large-scale load balancing you might look the other eBFP solutions.
If you’re running in a container of some kind, you can pass a card directly to the instance (SR-IOV or full PCI) to take advantage of it.
High Availability
This best done with a pair of routers that share a virtual IP managed by keepalived. It’s fairly simple to deploy as you just:
- Set up the first router with all the configuration you need. (let’s assume you’ve done this already)
- Install keepalived with a simple config.
- Clone and tweak so the IPs, Hostname, Config works.
Change Admin IPs
Your router is still has a DHCP WAN address and that’s fine. But you’ll need to select something static for the virtual WAN ip address you’re about to create. I’ve used 192.168.1.2, but adjust as needed.
Install Keepalived
# Just install the core server, without perl and ipvsadm
sudo apt install --no-install-recommends keepalived
Create a Virtual LAN and WAN Address
sudo vi /etc/keepalived/keepalived.conf
You’ll note that we’ve set the state to “BACKUP” and we’ll use that on both. This keeps them from flapping during any intermittent issues.
vrrp_instance WAN {
state BACKUP
interface eth0
virtual_router_id 51
virtual_ipaddress { 192.168.1.2/24 }
}
vrrp_instance LAN {
state BACKUP
interface eth1
virtual_router_id 52
virtual_ipaddress { 192.168.2.1/24 }
}
Add Firewall Rules
Add this at the bottom of your chain input section so you accept vrrp traffic. 224.0.0.18 is a reserved IPv4 multicast address specifically designated for the Virtual Router Redundancy Protocol
...
...
table inet filter {
chain input {
...
...
# Allow VRRP on both relevant interfaces. The 224 is
iif { $WAN, $LAN } ip protocol vrrp accept
iif { $WAN, $LAN } ip daddr 224.0.0.18 accept
}
}
Test and Clone
Start the service and observe the log to see the status messages.
sudo systemctl start keepalived.service
sudo journalctl -u keepalived*
If things go well, you’ll see it start as backup, then quickly enter master when it doesn’t find any peers to say otherwise. An ip a will show the virtual IPs in place.
Clone the box (or the config), change the LAN IP so it doesn’t conflict, and bring both up. You can experiment by taking the service up and down to watch the IP shift.
Active vs Passive
This is an Active/Passive arrangement. For the most part, there’s no advantage to Active/Active. Either system can move traffic faster than the uplink permits. If you do have more traffic than either one can move on it’s own, you can’t take one down.
The main benefit of Active/Active is that you know it works. You don’t want to be surprised during an outage. It allows you to gracefully drain traffic for controlled maintenance. Also, you can’t always get a 200G system.
To setup Active/Active for capacity you create a team of three or more. Each is MASTER of a separate VIP and they all back each other up. DHCP passes out the gateway addresses round-robin to balance load. Keep in mind that traffic shaping with tc happens after nftables, so you can’t shape your way out of a capacity issue.
Keeping Them Synced
How do you keep them in sync? You can probably just remember to update rules only on router-1 and then run rsync. Though you’ll need to grant yourself (or a service account) permissions.
# On both routers
## Change the owner of the nftable data
sudo chown -R ${USER} /etc/nftables*
## Allow yourself to run nft without a password prompt so you can automate the reload
echo "$USER ALL=(root) NOPASSWD: /usr/sbin/nft" | sudo tee /etc/sudoers.d/nft
sudo chmod 440 /etc/sudoers.d/nft
# On your primary router
## Generate a key and copy it to the other router (make sure dns resolves it correctly)
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N ''
ssh-copy-id router-2
## Run a sync
rsync --archive --delete --inplace --verbose /etc/nftables.conf /etc/nftables router-2:/etc/
## Trigger a reload on the other router
ssh router-2 sudo nft -f /etc/nftables.conf
You can probably put those last two commands into the bash file go-go-geo-nft to save typing them out. Even better, create a systemd watcher that detects changes and syncs automatically.
Election and Priority
The IP will only move when one system goes offline. If both come back up perfectly at once, they’ll have an election and the highest management IP wins. You can override this with an explicit priority or script it if you have reason to.
You can also change one to state MASTER and it will take over the role whenever it’s up. This is useful when you have a hardware preference or are running additional software that can’t do multi-master.
Elections are held via broadcast (the 224 address) but unicast is also supported if you’re in a cloud environment that blocks broadcast
If you have other software running and want to bind it to the virtual ip before it becomes instantiated, look at adding net.ipv4.ip_nonlocal_bind=1 to the /etc/sysctl.conf. Though for things like dnsmasq it’s not needed.
Connection Tracking
The only other issue with failing over is connection state. When you fail over, any session dependant applications (like SSH) or games get dropped. This isn’t a large issue as the web is (mostly) stateless. But you can deploy conntrackd if desired.
Scaling Up
Single Device
These days, you can Expect 100–400 concurrent connections per device at peak so you outpace a single device faster than you think. On a single device, you can optimize some kernetl settings.
# Create a drop-in configuration file in /etc/sysctl.d for these settings
net.core.netdev_max_backlog=250000
net.core.somaxconn=65535
net.ipv4.ip_local_port_range=1024 65535
net.ipv4.tcp_tw_reuse=1
net.netfilter.nf_conntrack_max=2000000
net.netfilter.nf_conntrack_tcp_timeout_established=86400
net.netfilter.nf_conntrack_generic_timeout=120
Smart Switches
On a old-style campus WAN, you have (expensive) layer 3 switches and routers that use the ECMP protocol to use multiple paths. This has wider compatibility than agregating ports with LAG. You’d think you could just distrubute multiple gateways as part of DHCP, but clients handle that randomly or not at all.
Become More Active
A more modern and cost effective strategy is to move from an Active/Passive HA pair, to an Active/Active/Active set of peers.
- Create 3 or more routers
- Create 3 or more VRRP instances, where a different router is MASTER in each and the others are BACKUP
- Distrubute gateway via DHCP in round-robin to distribute load
IPv6 To The Rescue
If you can handle the new however, use and IPv6‑first design.
- Design a full IPv6‑first + IPv4 fallback addressing plan
- Browsers attempt IPv6 first
- Mobile OSes heavily favor IPv6
- Major content providers publish IPv6 natively
- Less NAT to handle.
DNS and DHCP
DHCP and DNS are optional, but sometimes expected in small environments and frequently under other duties as assigned for such systems. Dnsmasq is the go-to for most. I’ve found that particular software doesn’t scale well beyond a few thousand clients, but if you’ve gotten to that point, you’d want a dedicated system anyway. So it’s fine for small scale.
This is covered in many other locations but I’ll toss a sample config here just in case.
# Remove systemd-resolved if installed by the default debian instance
sudo systemctl stop systemd-resolved.service
sudo apt install dnsmasq
DNS
You’ll need to accept DNS requests by adding another accept at the bottom. UDP is traditional, but TCP is also used these days. Add this at the bottom of your input chain.
table inet filter {
chain input {
...
...
# Allow DNS queries
iif $LAN udp dport "domain" accept
iif $LAN tcp dport "domain" accept
}
}
By default, dnsmasq just fires up and starts work by caching DNS queries. It looks at your /etc/resolv/conf to see where to forward requests and your /etc/hosts for known hosts. You may want to tighten things up a bit with these.
sudo vi /etc/dnsmasq.conf
Just add these to the bottom.
# Don't forward plain names or non-routable IPs
domain-needed
bogus-priv
# Listen only on the LAN
interface=lo
interface=eth1
# Upstream DNS servers to forward requests to
server=8.8.8.8
server=8.8.4.4
# Cache size (1000 is a good start)
cache-size=1000
# Don't use the hosts file for entries
no-hosts
Create a seperate file for host entries as well. This will let you sync more easily later. Since debian’s packaging of dnsmasq makes use of a drop folder, let’s go ahead and embrace that and add entries in the native format.
sudo vi /etc/dnsmasq.d/hosts.conf
host-record=gateway,192.168.2.1
host-record=router-1,192.168.2.2
host-record=router-2,192.168.2.3
host-record=some-host,192.168.2.10
host-record=some-other-host,192.168.2.11
...
...
# Restart so dnsmasq reads the new config file.
sudo systemctl dnsmasq restart
# Tell the system to use itself for resolution as well.
echo "nameserver 127.0.0.1" | sudo tee /etc/resolv.conf
If you went to the trouble to set up sync above for nft, you can piggyback on that for dns.
# On both boxes
sudo test -f /etc/dnsmasq.d/hosts.conf || sudo touch /etc/dnsmasq.d/hosts.conf
sudo chgrp -R ${USER} /etc/dnsmasq.d/hosts.conf
sudo chmod -R 775 /etc/dnsmasq.d/hosts.conf
echo "$USER ALL=(root) NOPASSWD: /bin/systemctl restart dnsmasq, /bin/systemctl reload dnsmasq" | sudo tee /etc/sudoers.d/dnsmasq
sudo chmod 440 /etc/sudoers.d/dnsmasq
# On router-1, you can send the file and restart (reload doesn't load date from host.conf)
rsync --archive --delete --inplace --no-times --verbose /etc/dnsmasq.d/hosts.conf router-2:/etc/dnsmasq.d/
ssh router-2 sudo systemctl restart dnsmasq
### DHCP
You can pass out addresses as well. Just follow up the last bit with this. Though you may want to leave these commented out until your ready, to avoid confusing the overlay networks.
```bash
dhcp-range=192.168.2.100,192.168.2.200,12h
dhcp-option=option:router,192.168.2.1
dhcp-authoritative
Bonus Points
You have a couple of things worth automating. Changes to nft rules and updates to the GeoIP database.
Automate GeoIP
Those IP sets are refreshed the 1st of every month on db-ip.com. We should refresh accordingly. On Debian 13, systemd timers are preferred (cron isn’t even installed).
sudo vi /etc/systemd/system/geo-nft-update.service
[Unit]
Description=Update geo-nft IP datasets
Wants=network-online.target
After=network-online.target
[Service]
Type=oneshot
ExecStart=/etc/nftables/geo-nft/geo-nft.sh
sudo nano /etc/systemd/system/geo-nft-update.timer
[Unit]
Description=Monthly geo-nft update (3rd day)
[Timer]
OnCalendar=*-*-03 03:30:00
Persistent=true
RandomizedDelaySec=30m
[Install]
WantedBy=timers.target
# Just enable the timer. We don't want the service to run unless the timer starts it.
sudo systemctl daemon-reload
sudo systemctl enable --now geo-nft-update.timer
Config Sync
You can also use systemd to watch for file changes and trigget a sync and reload. Note the User= setting. This must be your account (or whatever account you used in setting up the sync and ssh above) to work as the manual process did. Alternatly, create a service account.
sudo vi /etc/systemd/system/router-sync.service
[Unit]
Description=Sync router configuration to router-2
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
User=YOU
ExecStart=/usr/bin/sudo /usr/sbin/nft -f /etc/nftables.conf
ExecStart=/usr/bin/rsync --archive --delete --inplace --no-times /etc/nftables.conf /etc/nftables router-2:/etc/
ExecStart=/usr/bin/ssh router-2 sudo nft -f /etc/nftables.conf
ExecStart=/usr/bin/sudo systemctl reload dnsmasq
ExecStart=/usr/bin/rsync --archive --delete --inplace --no-times /etc/dnsmasq.d/hosts.conf router-2:/etc/dnsmasq.d/
ExecStart=/usr/bin/ssh router-2 sudo systemctl reload dnsmasq
sudo vi /etc/systemd/system/router-sync.path
[Unit]
Description=Watch router config files
[Path]
PathChanged=/etc/nftables.conf
PathModified=/etc/nftables
PathChanged=/etc/dnsmasq.d/hosts.conf
Unit=router-sync.service
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now router-sync.path
I put both services in this watcher as an example. A a better design is a sperate watch for each to reduce curn if you have frequent changes.
Replace GeoIP
The tool we’re using downloads the full set of countries as a .csv and then converts them into nft definitions. It’s a bit slow and heavy. You can replace it with something like this and get a more compact (and prehapse better) CIDR format like this.
COUNTRY=ad
OUT=/etc/nftables/sets/${COUNTRY}.nft
tmp=$(mktemp)
{
echo "set ${COUNTRY}_ipv4 {"
echo " type ipv4_addr"
echo " flags interval"
echo " elements = {"
curl -s https://www.ipdeny.com/ipblocks/data/countries/${COUNTRY}.zone | \
awk '{print " "$1","}' | sed '$ s/,$//'
echo " }"
echo "}"
} > "$tmp"
mv "$tmp" "$OUT"
Troubleshooting
In a container, forwarding just stops. Restarting nftables has no affect but restarting systemd-networkd fixes it.
Incus especially likes to replace the bridge interfaces periodically.
sudo journalctl -u systemd-networkd -u NetworkManager --since "1 day ago"
May 28 12:55:50 shire1 systemd-networkd[486]: physSf5Ar4: Interface name change detected, renamed to vethe8067f8a.
May 28 12:55:50 shire1 systemd-networkd[486]: physnLFcl3: Interface name change detected, renamed to vethd880fdd5.
May 28 12:55:50 shire1 systemd-networkd[486]: vethc3f8a16b: Link UP
May 28 12:55:50 shire1 systemd-networkd[486]: vethc3f8a16b: Link DOWN
May 28 12:55:50 shire1 systemd-networkd[486]: veth243c40db: Link UP
May 28 12:55:50 shire1 systemd-networkd[486]: veth243c40db: Link DOWN
May 28 12:55:51 shire1 systemd-networkd[486]: veth24f2e389: Link UP
May 28 12:55:51 shire1 systemd-networkd[486]: vethff99fb0d: Link UP
May 28 12:55:51 shire1 systemd-networkd[486]: veth24f2e389: Gained carrier
May 28 12:55:51 shire1 systemd-networkd[486]: vethff99fb0d: Gained carrier
This causes a loss of static mappings to the interfaces and this breaks how iff and off in your rules work. You must instead us iffname which is dynamically mapped. It’s slightly slower, but needed in some containers.
sudo sed -i.bak -E \
-e 's/\biif (\$WAN|\$LAN|\{ \$WAN)/iifname \1/g' \
-e 's/\boif (\$WAN|\$LAN|\{ \$WAN)/oifname \1/g' \
/etc/nftables.conf
-
I’ve read Rem Koolhaas meant this as mockey, so let’s think of it as “Simplify, then add lightness” as Chapman would say ↩︎
-
I don’t have a good attribution for this other than the many articles that assert this too ↩︎
-
https://wiki.nftables.org/wiki-nftables/index.php/Configuring_chains#Base_chain_types ↩︎
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.