Network - Deep Dive

The mechanics behind the network overview: how each path actually works, and what broke along the way.

Devices

Most of the "servers" are VMs hosted on the single core host - five of them - not separate machines. Hardware footprint is deliberately small.

Role	Where it runs	Notes
Core server	Bare metal	Hosts the Docker tiers, OpenVPN, DNS, Suricata, and the VirtualBox VMs
SIEM (Wazuh)	VM on the core	Bridged to the LAN; the heaviest of the VMs
Windows 11	VM on the core	Bridged; Sysmon + Wazuh agent, an endpoint-detection target
Ubuntu lab	VM on the core	Bridged; general Linux experimentation
Fedora lab	VM on the core	Bridged; a second distro family (different package manager, SELinux defaults)
REMnux	VM on the core	NAT-isolated, not bridged - malware analysis, kept off the LAN by design
Workstation	Wired client	Monitored Windows desktop
Laptop (remote room)	Wi-Fi client	Reaches the LAN via a secondary AP -> switch -> powerline link
Offensive-security laptop	Wi-Fi / wired client	Training and tooling
Mobile	Wi-Fi client	Phone running a Kali chroot
Edge router	Bare metal	RouterOS; NAT, firewall, ships syslog to the SIEM

Ingress - two paths, neither a wide-open port

A single reverse proxy (Traefik) terminates all inbound TLS with a wildcard ACME certificate (*.example.com). External reachability uses two deliberate paths, and no general inbound port is ever opened:

VPN (the only inbound NAT). The edge firewall forwards exactly one WAN port: the VPN listener. Once on the tunnel you are on the LAN. Sensitive services (the forge, dashboards) sit behind allowlist middleware restricting them to LAN + VPN ranges, so they are unreachable from the open internet even though the proxy is internet-adjacent.
Outbound tunnels for public services. Select internal services are published through an outbound-initiated tunnel, so they are reachable publicly with zero inbound ports and no port-forwarding: the tunnel daemon dials out, nothing dials in. (Several providers offer this model.)

The tunnel is a deliberate trade, not a free lunch: it does put a third-party provider in the path for the handful of services I publish. I take that trade with eyes open. I am already trusting my ISP with every packet that leaves the house, and a public resolver (1.1.1.1) as a DNS fallback - so for a couple of public-facing pages, a provider that lets me keep zero inbound surface is the side of the line I would rather be on. The privacy half of that worry - what my ISP can see of my DNS - is handled separately, by running my own resolver with DoH (below).

Admin SSH is key-only on per-host non-default ports.

Certificates - one wildcard, auto-fanned out

All TLS in the lab is a single wildcard certificate (*.example.com). The reverse proxy obtains and auto-renews it with a DNS-01 challenge (proving control through the DNS provider's API), so renewal needs no inbound port 80 - which matters, because the lab opens no HTTP port to the internet at all.

The catch: the VPN server and the DoH resolver also need that cert, and neither sits behind the proxy. Rather than give each its own ACME client, the renewal is fanned out - when the proxy rewrites its cert store, two systemd path watchers notice and run small extract-and-reload scripts:

proxy renews *.example.com  ->  writes its cert store
        |
        |-- path watcher --> extract --> DoH resolver  --> restart (skips if unchanged)
        `-- path watcher --> extract --> OpenVPN server --> reload

One renewal keeps every TLS consumer current, with zero manual steps. (VPN client profiles embed the CA chain rather than the server cert, so they keep working across renewals - they only need reissuing if the VPN's own PKI changes.)

Egress

Outbound is treated as carefully as inbound. The worked example is the CI platform's two runners:

internal runner - DNS-gapped, runs untrusted build/test jobs;
external runner - has egress, runs only the credentialed publish step.

Routed by tag, so untrusted build steps never share an egress path with the publish step. This is "controlled boundaries, not blanket controls": the only component with general egress is the one narrow step that needs it. See CI/CD Publishing.

DNS

Self-hosted in two layers on the core server:

Unbound - recursive resolver for LAN, VPN, and containers, with blocklisting of known-bad domains;
dnsproxy - a DNS-over-HTTPS frontend for clients that want encrypted resolution.

Internal names resolve to internal addresses (split-horizon), so the same hostname works on-LAN and over VPN without exposing anything publicly. Resolver endpoints are loopback aliases on the host, decoupled from any single NIC so DNS survives reboots and link changes.

DNS failover

Self-hosting DNS on the core server creates a single point of failure: LAN clients are pointed at the core's resolver, so if the core is down the LAN loses name resolution - and, in practice, the internet with it. A maintenance reboot that left the core half-up made that risk concrete rather than theoretical.

The fix deliberately lives on the edge router, not the core - failover has to run on something other than the box that fails. The router (RouterOS Netwatch) probes the core's DoH listener on a short interval and drives a simple up/down state machine:

Down - the instant the listener stops answering, the router repoints LAN DNS at two public resolvers (Quad9 + Cloudflare). This is "do no harm": degraded mode loses local names and the core's own filtering, but the LAN keeps working internet.
Up (auto-revert) - once the probe sees the listener healthy again, the router hands DNS back to the core, behind a short debounce so a flapping reboot cannot yank it back and forth.

The probe is a bare TCP connection to the port the router already uses for live DoH, so it adds no new traffic pattern and needs no DNS itself to run. Because the failover is autonomous and on an independent device, the core can fail, flap, or sit broken and the LAN is protected regardless - no heroics required from the core itself.

A DNS-server change on a router is also a textbook hijack indicator, so each failover logs a marker and pushes a dedicated alert the moment it happens - the operational change leaves a clear trail. See Security Stack.

Monitoring fan-in

Telemetry converges on the SIEM VM:

Endpoint agents on the core server and the client machines report to the SIEM.
The edge router ships syslog to the SIEM (with custom decoders/rules for router events).
Suricata runs on the core server and feeds network IDS alerts in.

So a single pane sees host events, network events, and perimeter events. Detail in Security Stack.

Gotchas

The port was already taken - by something I did not know was listening

Symptom. Standing up my own resolver, it would not start: the bind on port 53 failed. Nothing else I had installed was a DNS server, so on the face of it the port should have been free.

Root cause. systemd-resolved ships with a stub listener on 127.0.0.53:53 by default, and it had quietly been the box's resolver all along - /etc/resolv.conf pointed at it. So port 53 was occupied by a service I had never consciously configured. My resolver and the stub were fighting over the same socket.

Fix. Take the stub off the port and hand DNS to my own resolver:

# /etc/systemd/resolved.conf
DNSStubListener=no

then point /etc/resolv.conf at the resolver's own (loopback-alias) address rather than at 127.0.0.53. The resolver binds cleanly and is now the single authority for name resolution on the host.

Lesson. "Nothing is listening" is an assumption, not a fact - check it (ss -tulpn) before blaming your own config. On a modern systemd box, DNS is already being handled by something whether you asked for it or not; self-hosting means first displacing the default, not just adding a service next to it.

Incident - one cert expiry took down all LAN DNS

Symptom. After a maintenance window the whole LAN quietly lost name resolution - and with it, the internet. Nothing had obviously "crashed"; from a client's seat DNS simply stopped answering, with no single service to point at.

Root cause - coupling, not a crash. The core ran two roles entangled into one fragile chain: network infrastructure (resolver + DoH) and application platform (reverse proxy + containers). A TLS certificate expired. That broke the container runtime's DNS - its resolver address sat on the default bridge with a routing asymmetry, so queries arrived but answers left by the wrong interface. Broken DNS stopped the proxy renewing the cert; the DoH frontend depended on that same cert and died with it - and DoH was what the LAN resolved through. One expiry, every layer downstream falling over in sequence. And because the application tier had no expressed startup ordering, a reboot would cheerfully bring it up against an infrastructure tier that was not ready.

Fix - tier it, and make the dependency graph explicit. The platform is now split into tiers behind a Docker-independent anchor:

A oneshot systemd unit runs before Docker, Unbound and dnsproxy and pins the resolver and proxy endpoints to loopback aliases - addresses that exist before any NIC or the container runtime, so DNS no longer rides a bridge that can vanish.
A health-gated tier-0 target flips to "ready" only once DNS actually answers and the proxy returns healthy. The application tier declares Requires= / After= that gate, so tier 1 can never start against a broken tier 0.
A last-known-good snapshot of the rendered config is promoted only after the full stack confirms healthy - a bad boot falls back to the last config that worked rather than promoting itself.

Back-test. Rebuilt and rebooted the host four times to exercise it; the last two came up clean end to end, with no manual intervention. The earlier runs are the point - they surfaced the ordering bugs the fix then closed.

Lesson. A reliability failure is rarely the thing that "broke" - here it was a routine cert expiry - it is the coupling that lets one failure cascade. The fix was not a louder cert alarm; it was removing the coupling: an explicit dependency graph, a DNS endpoint that does not depend on the thing it serves, and a config that only advances when the whole stack agrees it is healthy.

network · egress is constrained too

sed -n '/default deny/,/Deny Telnet/p' roles/ufw/tasks/main.yml

- name: Set UFW default deny incoming default: deny - name: Set UFW default allow outgoing direction: outgoing # ... then explicit egress denials: - name: Deny FTP outbound rule: deny direction: out - name: Deny Telnet outbound rule: deny direction: out

# default-deny inbound is table stakes; the interesting part is constrained egress - FTP/Telnet (and SSH) are denied outbound, so a compromised service on the box cannot quietly phone out over them.