https

One problem with having decent iptables compatibility layers for nftables is that there's a dearth of documentation for the new stuff; even 2024 stackoverflow questions (and answers) just use the iptables commands underneath. So, when I first set up the "obvious" nft commands to redirect :80 and :443 to internal non-privileged ports (which then the "quadlet" PublishPort= directive would re-namespace internally so that nginx would see them as 80 and 443 again) it worked, but only from off-box - connections (to the dns name and thus the outside IP address) from on box didn't go through that path. Not great, but most of my searching turned up incorrect suggestions of turning on route_localnet so I just left it that way.

Then a Christmas Day power outage led me to discover that there was a race condition between the primary nftables service starting up and populating the "generic" tables, and my service adding to those tables, which caused the forwarding rules to fail to start.1 At least the command error showed up clearly in journalctl.

Application-specific tables

The one other service that I found using nftables is sshguard2 - which doesn't depend on the service starting at all. Instead, sshguard.service does nft add table in an ExecStartPre and a corresponding nft delete table in ExecStartPost - then the sshguard command itself just adds a blacklist chain and updates an "attackers" set directly. This doesn't need any user-space parts of nftables to be up, just the kernel.

I updated my service file to do the same sort of thing, but all expanded out inline - used ExecStartPre to add table (and add chain), matching delete table in ExecStopPost, and the apply the actual port redirects with ExecStart. The main rules were unchanged - a prerouting tcp dport 443 redirect worked the way it did before, only handling outside traffic; the new trick was to add a oifname "lo" tcp dport 443 redirect to an output chain, since locally-initiated traffic doesn't go through prerouting. (Likewise for port 80 since I still have explicit 301 redirects that promote http to https.)

The cherry on top was to add counter to all of the rules - so I could see that each rule was firing for the appropriate kind of traffic, just by running curl commands bracketed by nft list table ... |grep counter and seeing the numbers go up.

Victory, mostly

The main reason for fixing localhost is that it let me run a bunch of "the websites are serving what I expect them to be" tests locally rather than on yet another machine. This is a bit of all-eggs-in-one-basket, but I'll probably just include a basic "is that machine even up" checking on a specifically-off-site linode, or maybe a friend's machine - really, if the web server machine is entirely off line I don't really need precise detail of each service anyway.


  1. This is the "normal" systemd problem, where After= means "after that other service successfully starts" which isn't good enough for trivial .service files that don't explicitly cooperate. It might be possible to fix this by moving the nft -f /etc/nftables.conf from ExecStart= to ExecStartPre= (since it's a Type=oneshot service anyway) so that it exits before the job is considered "started" but the sshguard-inspired workaround is cleaner anyway. 

  2. sshguard is a simple, configurable "watchdog" that detects things like repeated ssh authorization failures (by reading the logs in realtime) and responds with various levels of blackhole routes, using nftables to drop the packets. Professional grade, as long as you have an alternate way back in if it locks you out. 

The ultimate goal of the Popular Web servers discussion was to actually make up my mind as to what to actually run. The diversity of options made me realize that SSL termination and web serving was an inherently modular thing, and since I wanted some amount of isolation for it anyway, this would be a good opportunity to get comfortable with podman.

What shape is the modularity?

The interface has three basic legs:

  • Listen on tcp ports 80 and 443
  • Read a narrow collection of explicitly exported files
  • (later) connect to other "service" containers.

(Some of the options in the survey, like haproxy, only do the "listen" and "connect" parts, but that reduces the "read files" part to running a "static files only" file server container (which has access to the collection of files) and having haproxy connect to that. For the first pass I'm not actually going to do that, but it's good to know in advance that this "shape" works.)

Listening ports

If I'm running this without privileges, how is it going to use traditionally "reserved" ports? Options include

  • have systemd listen on them and pass a filehandle in to the container
  • run a socat service to do the listening and reconnecting
  • lower /proc/sys/net/ipv4/ip_unprivileged_port_start from 1024 to 79
  • use firewall rules to "translate" those ports to some higher numbered ones.

I actually used the last one: a pair of nft commands, run in a Type=oneshot systemd service file to add rules that add rule ip nat PREROUTING tcp dport 80 redirect to (each unprivileged target port). This seemed like the simplest bit of limited privilege to apply to this problem, as well as being efficient (no packet copying outside the kernel, just NAT address rewriting) - but do let me know if there's some other interface that would also do this.

Reading a set of files

docker and podman both have simple "volume" (actually "bind mount") support to mount an outside directory into the container; this also gives us some adminstrative options on the outside, like moving around the disks that the files are on, or combining multiple directories, without changing the internals at all.

Currently, the directory is mounted as /www inside the container, and I went with the convention of /www/example.com to have a directory for each FQDN. (For now this means a bunch of copy&paste in the nginx.conf but eventually it should involve some more automation than that, though possibly on the outside.)1

In order to enable adding new sites without restarting the container, the nginx.conf is also mounted from the outside, as a single-file bind mount - using exec to nginx -s reload avoids restarting the container to apply the changes, allows for automatic generation of the config from outside, without allowing the container itself access to change the configuration.

Connecting to other containers

(Details to follow, pending actually using this feature; for now it's sufficient to know that the general model makes sense.)

Why podman over docker?

podman has a bunch of interesting advantages over docker:

  • actual privilege isolation - docker itself manages access to a service that does all of the work as root; podman actually makes much more aggressive use of namespaces, and doesn't have a daemon at all, which also makes it easier to manage the containers themselves.
  • podman started enough later than docker that they were able to make better design choices simply by looking at things that went wrong with docker and avoid them, while still maintaining enough compatibility that it remained easy to translate experience with one into success with the other - from a unix perspective, less "emacs vs vi" and more "nvi vs vim".

Mount points

Originally I did the obvious Volume mount of nginx.conf from the git checkout into /etc/nginx/nginx.conf inside the container. Inconveniently - but correctly2 - doing git pull to change that file does the usual atomic-replace, so there's a new file (and new inode number) but the old mount point is still pointing to the old inode.

The alternative approach is to mount a subdirectory with the conf file in it, and then symlink that file inside the container.3

LetsEncrypt

We need the certbot and python3-certbot-nginx packages installed in the pod. python3-certbot-nginx handles adjusting the nginx config during certbot operation (see github:certbot/certbot for the guts of it.

Currently, we stuff these into the primary nginx pod, because it needs to control the live webserver to show that it controls the live webserver.

When used interactively, certbot tells you that "Certbot has set up a scheduled task to automatically renew this certificate in the background." What this actually means is that it provides a crontab entry (in /etc/cron.d/certbot) and a system timer (certbot.timer) which is great... except that in our podman config, we run nginx as pid 1 of the container, don't run systemd, and don't even have cron installed. Not a problem - we just create the crontab externally, and have it run certbot under podman periodically.

Quadlets

Quadlets are just a new type of systemd "Unit file" with a new [Container] section; everything from the podman commandline should be expressible in the .container file. For the nginx case, we just need Image=, PublishPort=4, and a handful of Volume= stanzas.

Note that if you could run the podman commands as you, the .container Unit can also be a systemd "User Unit" that doesn't need any additional privileges (possibly a loginctl enable-linger but with Ubuntu 24.04 I didn't actually need that.)

Walkthrough of adding a new site on a new FQDN

DNS

Start with DNS. Register a domain (in this case, thok.site, get it pointed to your nameserver5, have that nameserver point to the webserver.

Certbot

Once certbot is registered,

$ podman exec systemd-nginxpod certbot certonly --nginx --domain thok.site

takes a little while and then gets the certificate. Note that at this point, the nginx running in that pod knows nothing about the domain; certbot is doing all the work.

Get the site content

I have a Makefile that runs git clone to get the site source, or git pull if it's already present, and then uses ssite build to generate the HTML in a separate directory (that the nginx pod has mounted.)

Update the common nginx.conf

Currently nginx.conf is generated with cogapp, so it's just a matter of adding

# [[[cog https("thok.site") ]]]
# [[[end]]]

and rerunning cogapp to expand it in place.

Kick nginx

make reload in the same Makefile, which just does

$ podman exec systemd-nginxpod nginx -s reload

Done! Check it...

At this point, the site is live. (Yes, the very site you're reading this on; the previous sites all had debugging steps that made the notes a lot less clear, so I didn't have a clean set of directions previously...) Check it in a client browser, and add it to whatever monitoring you have.

Conclusions

So we now have a relatively simple path from "an idea and some writing" to "live website with basic presentation of content". A bit too much copy-and-paste currently, and the helper Makefile really needs to be parameterized or become an outright standalone tool. (Moving certbot to a separate pod also needs investigating.) Now back to the original tasks of moving web servers off of old hardware, and pontificating actually blogging!


  1. Not yet as automated as I'd like, but currently using Ned Batchelder's Cog to write macros in python and have them update the nginx config in-place in the same file. Eliminates a bunch of data entry errors, but isn't quite an automatic "find the content directories and infer the config from them" - but it is the kind of rope that could become that. 

  2. While this is a little messy for a single config file, it would be a reasonable direction to skip the symlinks and just have a top-level config file inside the container include subdir/*.conf to pick up all of the (presumably generated) files there, one per site. This is only an organizational convenience, the resulting configuration is identical to having the same content in-line, and it's not clear there's any point to heading down that path instead of just generating them automatically from the content and never directly editing them in the first place. 

  3. The PublishPort option just makes the "local" aliases for ports 80 and 443 appear inside the container as 80 and 443; there's a separate pod-forward-web-ports.service that runs the nftable commands (with root) as a "oneshot" systemd System Service. 

  4. In my case, that means "update the zone file with emacs, so it auto-updates the serial number" and then push it to my CVS server; then get all of the actual servers to CVS pull it and reload bind. 

I posted1 a poll on mastodon:

What's your choice for an internet-facing web server, in 2024? Security over performance, ease (or lack) of configuration is a bonus; presence in Debian or Ubuntu preferred but anything I can build from source (so, probably not written in go) and has a good CVE story is of interest.

In the initial set I included Apache, nginx, lighttpd, caddy, and webfs (based on them showing up in popcon.) So far nginx is in the lead with caddy and Apache surprisingly close to tied, but the fascinating bit was the followups about servers that either I hadn't heard of, or didn't realize qualified. (It got boosted early on by Tim Bray and Glyph which got it much broader attention than I expected, which I believed really helped reach people who provided some of the more unusual followups.)

Twisted Python

Twisted is actually packaged and has decades of usage - it just wasn't tagged with the httpd virtual package in debian so it didn't come up in my original search. (It also doesn't currently include any config to run by default, but really it would just be a basic .service file to invoke twisted web with some arguments.) Glyph points out that twisted's TLS support is in C, but that parsing HTTP with C in 2024 is just asking for trouble.

Kestrel+YARP

Kestrel is the web server component of DotNet Core - this combination handles all of the app service front end traffic for Azure

YARP itself is a standalone MIT-licensed reverse proxy written in C# (nothing to do with Edgar Wright.)

Thanks to Blake Coverett for pointing this one out (they used it under Debian and Ubuntu in production!) but the DotNet ecosystem is pretty far outside my comfort zone/tech bubble.

OpenBSD httpd

OpenBSD ships a default http server with strong security and simple configuration. This does look solid and would be high on the list if I were running OpenBSD - there's some risk that it uses OpenBSD's advanced isolation features in ways that a naïve Linux port might not get right, but if I find an active one I'll look further.

There's an AsiaBSDCon 2015 paper which describes the history of it replacing nginx (which itself replaced an Apache 1 fork) as the native OpenBSD web server; this includes a long discussion of their attempts to harden nginx that are worth a look in terms of secure software development challenges.

haproxy

Marko Karppinen pointed out that haproxy (which is packaged but doesn't Provides: httpd either) actually works directly as a web server - no direct file support, but it can terminate HTTPS connections and pass the connections on to HTTP backends. (As of haproxy 2.8, acme.sh can update a running haproxy directly, without disruptive restarts.)

Traefik

Gigantos pointed out that Traefik can also terminate HTTPS directly, and has builtin ACME (Let's Encrypt) support as well as being able to do service discovery instead of needing direct per-site configuration - depending on the shape of those providers that might not end up being less work but it's arguably putting the information in a more correct place.

NGINX Unit

PointlessOne suggested that for very dynamic backends, NGINX Unit was worth a look - it supports a huge variety of languages while still having attention on security and performance.

Apache with mod_md

Most of the comments on apache were about how "it still works" and had decades of attention, but Marcus Bointon pointed out mod_md which adds ACME support directly as an Apache Module (shipped with apache since 2.4.30, which predates Ubuntu 20.04, it's been around for a while) defaulting to Let's Encrypt. (He goes on to complain about the lack of HTTP/3 support, but from my perspective it's evidence that Apache isn't standing still after all.)

Lighttpd

There was actually one vote against lighttpd from Chris Siebenmann as having stagnated too much to seriously consider for new deployments. (It does still get active development but I'm going for general impressions here and this one was interesting.)

h2o

FunkyBob chimed in near the end of the survey with h2o (in front of Django.) h2o turns out to be

  • MIT licensed
  • Written in C
  • Responds reasonably to CVEs
  • Used to do releases on github but now takes the interesting approach that ... each commit to master branch is considered stable and ready for general use ...
  • Packaged in ubuntu and debian (also without Provides: httpd, but it's a 2018 version with a bunch of cherry-picked fixes the look like upstream, so I'm not sure how "actually" up-to-date that version is (late 2023 best-case though.)
  • Also available as a library, which is common in go projects but a lot more unusual in C servers.

I thought I'd never heard of it, but I'd starred it on github at some unknown point.

Conclusions

Primarily Confirmation

  • There's more life in Apache than I'd realized (mod_md in particular)
  • nginx is still the mainstream choice
  • caddy is definitely up-and-coming with an enthusiastic community

Actual final numbers: 491 people responded.

  • 19% Apache
  • 53% nginx
  • 2% lighttpd
  • 21% Caddy
  • 0% webfs
  • 4% other/explain

Unexpected Highlights

Not going to do another survey on them, but I was pleased (and surprised) at the number of serious alternatives that turned up, including a few things that I knew about but didn't realize were legitimate answers to my question:

  • Twisted Web (including python3-txacme)
  • Nginx UNIT
  • haproxy
  • Traefik
  • Kestrel+YARP (dotnet)

Personal Decisions

Part of the motivation for the survey was that I was stuck on an upgrade path for some old blogs and project sites. While that sounds low-value, it's also my playground for professional builds and recommendations, so I take it way more seriously than I probably should...

While the survey results didn't give me a final answer (nor were they intended to) they did reduce some fretting and lead me to a more direct plan:

  • put a bounded amount of time into building caddy to my latest-from-source standards
  • prototype something with Twisted Web, particularly for the fast path "idea → domain registration → publication" projects, and see how it feels for more conventional use
  • fall back to nginx if I don't get anywhere in a week.

What Actually Happened

Since I wanted to get at least one blog up and running quickly to publish this article, I took a shorter path:

  • Installed blag which is probably the least-effort markdown blog to get going2
  • Used my draft caddy-in-podman notes to do a quick nginx-in-podman, rootless
  • Used nftables NAT support to forward 80/443 to the podman published ports.

That's just on my laptop but by the time you read this it'll be transplanted to a real server.

The key here is that the nginx-in-podman bit is just the server:

  • it bind-mounts nginx.conf
  • it bind-mounts a multiple-domain content directory

so content and operation are relatively separated, a new server can be tested with the live content, and more importantly - if I succeed in my caddy building efforts, I can drop in a caddy-in-podman container and "effortlessly" swap from nginx to caddy without actually any real sysadmin effort beyond a podman stop/podman run (which also leaves me a quick path to rolling back to the working version.) Yes, this is the whole promise of container-based modularity, but I needed to see it scale down without a bunch of larger scale complexity.3


  1. Three entire months ago, in July 2024, simplifying the short-mastodon-rant to large-blog-rant pipeline is the entire thing I was trying to kick off here... 

  2. But see followup discussion on static site tools

  3. Does Kubernetes bring anything to an environment with 2 or 3 containers? I'm not prepared to find out just yet.