https
One problem with having decent iptables
compatibility layers for
nftables
is that there's a dearth of documentation for the new
stuff; even 2024 stackoverflow questions (and answers) just use the
iptables
commands underneath. So, when I first set up the "obvious"
nft
commands to redirect :80
and :443
to internal non-privileged
ports (which then the "quadlet" PublishPort=
directive would
re-namespace internally so that nginx
would see them as 80
and
443
again) it worked, but only from off-box - connections (to the
dns name and thus the outside IP address) from on box didn't go
through that path. Not great, but most of my searching turned up
incorrect suggestions of turning on route_localnet
so I just left
it that way.
Then a Christmas Day power outage led me to discover that there was a
race condition between the primary nftables
service starting up and
populating the "generic" tables, and my service adding to those
tables, which caused the forwarding rules to fail to start.1
At least the command error showed up clearly in journalctl
.
Application-specific tables
The one other service that I found using nftables
is
sshguard
2 - which doesn't depend on the service starting
at all. Instead, sshguard.service
does nft add table
in an
ExecStartPre
and a corresponding nft delete table
in
ExecStartPost
- then the sshguard
command itself just adds a
blacklist chain
and updates an "attackers" set
directly. This
doesn't need any user-space parts of nftables
to be up, just the
kernel.
I updated my service
file to do the same sort of thing, but all
expanded out inline - used ExecStartPre
to add table
(and add
chain
), matching delete table
in ExecStopPost
, and the apply the
actual port redirects with ExecStart
. The main rules were
unchanged - a prerouting tcp dport 443 redirect
worked the way it
did before, only handling outside traffic; the new trick was to add a
oifname "lo" tcp dport 443 redirect
to an output chain, since
locally-initiated traffic doesn't go through prerouting
. (Likewise
for port 80
since I still have explicit 301
redirects that promote
http
to https
.)
The cherry on top was to add counter
to all of the rules - so I
could see that each rule was firing for the appropriate kind of
traffic, just by running curl
commands bracketed by
nft list table ... |grep counter
and seeing the numbers go up.
Victory, mostly
The main reason for fixing localhost is that it let me run a bunch of "the websites are serving what I expect them to be" tests locally rather than on yet another machine. This is a bit of all-eggs-in-one-basket, but I'll probably just include a basic "is that machine even up" checking on a specifically-off-site linode, or maybe a friend's machine - really, if the web server machine is entirely off line I don't really need precise detail of each service anyway.
-
This is the "normal" systemd problem, where
After=
means "after that other service successfully starts" which isn't good enough for trivial.service
files that don't explicitly cooperate. It might be possible to fix this by moving thenft -f /etc/nftables.conf
fromExecStart=
toExecStartPre=
(since it's aType=oneshot
service anyway) so that it exits before the job is considered "started" but thesshguard
-inspired workaround is cleaner anyway. ↩ -
sshguard is a simple, configurable "watchdog" that detects things like repeated ssh authorization failures (by reading the logs in realtime) and responds with various levels of blackhole routes, using
nftables
to drop the packets. Professional grade, as long as you have an alternate way back in if it locks you out. ↩
The ultimate goal of the Popular Web servers discussion
was to actually make up my mind as to what to actually run. The
diversity of options made me realize that SSL termination and web
serving was an inherently modular thing, and since I wanted some
amount of isolation for it anyway, this would be a good opportunity
to get comfortable with podman
.
What shape is the modularity?
The interface has three basic legs:
- Listen on tcp ports 80 and 443
- Read a narrow collection of explicitly exported files
- (later) connect to other "service" containers.
(Some of the options in the survey, like haproxy
, only do the
"listen" and "connect" parts, but that reduces the "read files" part
to running a "static files only" file server container (which has
access to the collection of files) and having haproxy
connect to
that. For the first pass I'm not actually going to do that, but
it's good to know in advance that this "shape" works.)
Listening ports
If I'm running this without privileges, how is it going to use traditionally "reserved" ports? Options include
- have
systemd
listen on them and pass a filehandle in to the container - run a
socat
service to do the listening and reconnecting - lower
/proc/sys/net/ipv4/ip_unprivileged_port_start
from 1024 to 79 - use firewall rules to "translate" those ports to some higher numbered ones.
I actually used the last one: a pair of nft
commands, run in a
Type=oneshot
systemd service
file to add rules that add rule ip
nat PREROUTING tcp dport 80 redirect to
(each unprivileged target
port). This seemed like the simplest bit of limited privilege to
apply to this problem, as well as being efficient (no packet copying
outside the kernel, just NAT address rewriting) - but do let me know if
there's some other interface that would also do this.
Reading a set of files
docker
and podman
both have simple "volume" (actually "bind
mount") support to mount an outside directory into the container; this
also gives us some adminstrative options on the outside, like moving
around the disks that the files are on, or combining multiple
directories, without changing the internals at all.
Currently, the directory is mounted as /www
inside the container,
and I went with the convention of /www/example.com
to have a
directory for each FQDN. (For now this means a bunch of copy&paste in
the nginx.conf
but eventually it should involve some more automation
than that, though possibly on the outside.)1
In order to enable adding new sites without restarting the container,
the nginx.conf
is also mounted from the outside, as a single-file
bind mount - using exec
to nginx -s reload
avoids restarting the
container to apply the changes, allows for automatic generation of the
config from outside, without allowing the container itself access to
change the configuration.
Connecting to other containers
(Details to follow, pending actually using this feature; for now it's sufficient to know that the general model makes sense.)
Why podman over docker?
podman
has a bunch of interesting advantages over docker
:
- actual privilege isolation -
docker
itself manages access to a service that does all of the work as root;podman
actually makes much more aggressive use of namespaces, and doesn't have a daemon at all, which also makes it easier to manage the containers themselves. podman
started enough later thandocker
that they were able to make better design choices simply by looking at things that went wrong withdocker
and avoid them, while still maintaining enough compatibility that it remained easy to translate experience with one into success with the other - from a unix perspective, less "emacs
vsvi
" and more "nvi
vsvim
".
Mount points
Originally I did the obvious Volume mount of nginx.conf
from the git
checkout into /etc/nginx/nginx.conf
inside the container.
Inconveniently - but correctly2 - doing git pull
to change
that file does the usual atomic-replace, so there's a new file (and
new inode number) but the old mount point is still pointing to the
old inode.
The alternative approach is to mount a subdirectory with the conf file in it, and then symlink that file inside the container.3
LetsEncrypt
We need the certbot
and python3-certbot-nginx
packages installed
in the pod. python3-certbot-nginx
handles adjusting the nginx
config during certbot
operation (see
github:certbot/certbot
for the guts of it.
Currently, we stuff these into the primary nginx
pod, because it
needs to control the live webserver to show that it controls the
live webserver.
When used interactively, certbot
tells you that "Certbot has set up
a scheduled task to automatically renew this certificate in the
background." What this actually means is that it provides a crontab
entry (in /etc/cron.d/certbot
) and a system timer (certbot.timer
)
which is great... except that in our podman config, we run nginx
as
pid 1 of the container, don't run systemd
, and don't even have
cron
installed. Not a problem - we just create the crontab
externally, and have it run certbot under podman periodically.
Quadlets
Quadlets are just a
new type of systemd
"Unit file" with a new [Container]
section;
everything from the podman
commandline should be expressible in the
.container
file. For the nginx
case, we just need Image=
,
PublishPort=
4, and a handful of Volume=
stanzas.
Note that if you could run the podman
commands as you, the
.container
Unit can also be a systemd
"User Unit" that doesn't
need any additional privileges (possibly a loginctl enable-linger
but with Ubuntu 24.04 I didn't actually need that.)
Walkthrough of adding a new site on a new FQDN
DNS
Start with DNS. Register a domain (in this case, thok.site
, get it
pointed to your nameserver5, have that nameserver point to the
webserver.
Certbot
Once certbot is registered,
$ podman exec systemd-nginxpod certbot certonly --nginx --domain thok.site
takes a little while and then gets the certificate. Note that at this
point, the nginx
running in that pod knows nothing about the domain;
certbot
is doing all the work.
Get the site content
I have a Makefile that runs git clone
to get the site source, or
git pull
if it's already present, and then uses ssite build
to
generate the HTML in a separate directory (that the nginx
pod has
mounted.)
Update the common nginx.conf
Currently nginx.conf
is generated with cogapp
, so it's just a
matter of adding
# [[[cog https("thok.site") ]]]
# [[[end]]]
and rerunning cogapp
to expand it in place.
Kick nginx
make reload
in the same Makefile, which just does
$ podman exec systemd-nginxpod nginx -s reload
Done! Check it...
At this point, the site is live. (Yes, the very site you're reading this on; the previous sites all had debugging steps that made the notes a lot less clear, so I didn't have a clean set of directions previously...) Check it in a client browser, and add it to whatever monitoring you have.
Conclusions
So we now have a relatively simple path from "an idea and some
writing" to "live website with basic presentation of content". A bit
too much copy-and-paste currently, and the helper Makefile
really
needs to be parameterized or become an outright standalone tool.
(Moving certbot
to a separate pod also needs investigating.) Now
back to the original tasks of moving web servers off of old hardware,
and pontificating actually blogging!
-
Not yet as automated as I'd like, but currently using Ned Batchelder's Cog to write macros in python and have them update the
nginx
config in-place in the same file. Eliminates a bunch of data entry errors, but isn't quite an automatic "find the content directories and infer the config from them" - but it is the kind of rope that could become that. ↩ -
In general, you want to "replace" a file by creating a temporary (in the same directory) then renaming it to the correct name; this causes the name to point to the new inode, and noone ever sees a partial version - either the new one, or the old one, because no "partial" file even exists, it's just a substitution in the name to inode mapping. There are a couple of edge cases, though - if the existing file has permissions that you can't recreate, if the existing file has hardlinks, or this one where it has bind mounts. Some editors, like
emacs
, have options to detect the multiple-hard-links case and trade off preserving the links against never corrupting the file; this mechanism won't detect bind mounts, though in theory you could find them in/proc/mounts
. ↩ -
While this is a little messy for a single config file, it would be a reasonable direction to skip the symlinks and just have a top-level config file inside the container
include subdir/*.conf
to pick up all of the (presumably generated) files there, one per site. This is only an organizational convenience, the resulting configuration is identical to having the same content in-line, and it's not clear there's any point to heading down that path instead of just generating them automatically from the content and never directly editing them in the first place. ↩ -
The
PublishPort
option just makes the "local" aliases for ports 80 and 443 appear inside the container as 80 and 443; there's a separatepod-forward-web-ports.service
that runs thenftable
commands (with root) as a "oneshot"systemd
System Service. ↩ -
In my case, that means "update the zone file with emacs, so it auto-updates the serial number" and then push it to my CVS server; then get all of the actual servers to CVS pull it and reload bind. ↩
I posted1 a poll on mastodon:
What's your choice for an internet-facing web server, in 2024? Security over performance, ease (or lack) of configuration is a bonus; presence in Debian or Ubuntu preferred but anything I can build from source (so, probably not written in go) and has a good CVE story is of interest.
In the initial set I included Apache, nginx
, lighttpd
, caddy
,
and webfs
(based on them showing up in
popcon.) So far nginx
is in the lead
with caddy
and Apache surprisingly close to tied, but the
fascinating bit was the followups about servers that either I hadn't
heard of, or didn't realize qualified. (It got boosted early on by
Tim Bray and
Glyph which got it much broader
attention than I expected, which I believed really helped reach people
who provided some of the more unusual followups.)
Twisted Python
Twisted is actually packaged and has decades of usage - it just wasn't
tagged with the httpd
virtual package in debian so it didn't come up
in my original search. (It also doesn't currently include any config
to run by default, but really it would just be a basic .service
file
to invoke twisted web
with some arguments.) Glyph points
out
that twisted's TLS support is in C, but that parsing HTTP with C in
2024 is just asking for trouble.
Kestrel+YARP
Kestrel is the web server component of DotNet Core - this combination handles all of the app service front end traffic for Azure
YARP itself is a standalone MIT-licensed reverse proxy written in C# (nothing to do with Edgar Wright.)
Thanks to Blake Coverett for pointing this one out (they used it under Debian and Ubuntu in production!) but the DotNet ecosystem is pretty far outside my comfort zone/tech bubble.
OpenBSD httpd
OpenBSD ships a default http server with strong security and simple configuration. This does look solid and would be high on the list if I were running OpenBSD - there's some risk that it uses OpenBSD's advanced isolation features in ways that a naïve Linux port might not get right, but if I find an active one I'll look further.
There's an AsiaBSDCon 2015
paper which
describes the history of it replacing nginx
(which itself replaced
an Apache 1 fork) as the native OpenBSD web server; this includes a
long discussion of their attempts to harden nginx
that are worth a
look in terms of secure software development challenges.
haproxy
Marko
Karppinen
pointed out that haproxy
(which is packaged but doesn't Provides:
httpd
either) actually works directly as a web server - no direct
file support, but it can terminate HTTPS
connections and
pass the connections on to HTTP backends. (As of haproxy 2.8,
acme.sh can update a running
haproxy
directly, without disruptive restarts.)
Traefik
Gigantos pointed out that Traefik can also terminate HTTPS directly, and has builtin ACME (Let's Encrypt) support as well as being able to do service discovery instead of needing direct per-site configuration - depending on the shape of those providers that might not end up being less work but it's arguably putting the information in a more correct place.
NGINX Unit
PointlessOne suggested that for very dynamic backends, NGINX Unit was worth a look - it supports a huge variety of languages while still having attention on security and performance.
Apache with mod_md
Most of the comments on apache were about how "it still works" and had
decades of attention, but Marcus
Bointon
pointed out
mod_md
which adds ACME support directly as an Apache Module (shipped with
apache since 2.4.30, which predates Ubuntu 20.04, it's been around for
a while) defaulting to Let's Encrypt. (He goes on to complain about
the lack of HTTP/3 support, but from my perspective it's evidence that
Apache isn't standing still after all.)
Lighttpd
There was actually one vote against lighttpd from Chris Siebenmann as having stagnated too much to seriously consider for new deployments. (It does still get active development but I'm going for general impressions here and this one was interesting.)
h2o
FunkyBob chimed in near the end of the survey with h2o (in front of Django.) h2o turns out to be
- MIT licensed
- Written in C
- Responds reasonably to CVEs
- Used to do releases on github but now takes the interesting approach that ... each commit to master branch is considered stable and ready for general use ...
- Packaged in ubuntu and debian (also without
Provides: httpd
, but it's a 2018 version with a bunch of cherry-picked fixes the look like upstream, so I'm not sure how "actually" up-to-date that version is (late 2023 best-case though.) - Also available as a library, which is common in go projects but a lot more unusual in C servers.
I thought I'd never heard of it, but I'd starred it on github at some unknown point.
Conclusions
Primarily Confirmation
- There's more life in Apache than I'd realized (
mod_md
in particular) nginx
is still the mainstream choicecaddy
is definitely up-and-coming with an enthusiastic community
Actual final numbers: 491 people responded.
- 19% Apache
- 53%
nginx
- 2%
lighttpd
- 21% Caddy
- 0%
webfs
- 4% other/explain
Unexpected Highlights
Not going to do another survey on them, but I was pleased (and surprised) at the number of serious alternatives that turned up, including a few things that I knew about but didn't realize were legitimate answers to my question:
- Twisted Web (including
python3-txacme
) - Nginx UNIT
haproxy
- Traefik
- Kestrel+YARP (dotnet)
Personal Decisions
Part of the motivation for the survey was that I was stuck on an upgrade path for some old blogs and project sites. While that sounds low-value, it's also my playground for professional builds and recommendations, so I take it way more seriously than I probably should...
While the survey results didn't give me a final answer (nor were they intended to) they did reduce some fretting and lead me to a more direct plan:
- put a bounded amount of time into building
caddy
to my latest-from-source standards - prototype something with Twisted Web, particularly for the fast path "idea → domain registration → publication" projects, and see how it feels for more conventional use
- fall back to
nginx
if I don't get anywhere in a week.
What Actually Happened
Since I wanted to get at least one blog up and running quickly to publish this article, I took a shorter path:
- Installed blag which is probably the least-effort markdown blog to get going2
- Used my draft caddy-in-podman notes to do a quick nginx-in-podman, rootless
- Used
nftables
NAT support to forward 80/443 to the podman published ports.
That's just on my laptop but by the time you read this it'll be transplanted to a real server.
The key here is that the nginx-in-podman bit is just the server:
- it bind-mounts nginx.conf
- it bind-mounts a multiple-domain content directory
so content and operation are relatively separated, a new server can be
tested with the live content, and more importantly - if I succeed in
my caddy building efforts, I can drop in a caddy-in-podman container
and "effortlessly" swap from nginx to caddy without actually any real
sysadmin effort beyond a podman stop
/podman run
(which also leaves
me a quick path to rolling back to the working version.) Yes, this is
the whole promise of container-based modularity, but I needed to see
it scale down without a bunch of larger scale complexity.3