I Cannot SSH into My Server Anymore (and That's Fine)

https://soap.coffee/~lthms/posts/i-cannot-ssh-into-my-server-anymore.html

116•TheWiggles•1mo ago

Comments

lawrencegripper•3w ago

I’ve been down a similar journey with Fedora Core OS and have loved it.

The predictability and drop in toil is so nice.

https://blog.gripdev.xyz/2024/03/16/in-search-of-a-zero-toil...

samhclark•3w ago

Yeah, this is closer to what I do, too. I was surprised not to see a Containerfile in the linked github repo in the article (https://github.com/lthms/tinkerbell)

I found working with normal `dnf` and normal config files much easier than dealing with Ignition and Butane. Plus, working with your image in CI/CD instead of locally fixed my ZFS instability. When Fedora kernel updates, but ZFS doesn't support that version yet, now it fails in GitHub Actions and the container is never built, so there's no botched update that my NAS mistakenly picks up.

stryan•3w ago

Quadlets are a real game changer for this type of small-to-medium scale declarative hosting. I've been pushing for them at work over ugly `docker compose in systemd units` service management and moved my home lab over to using them for everything. The latter is a similar setup to OP except with OpenSUSE MicroOS instead of Fedora CoreOS and I'm not so brave as to destroy and rebuild my VPS's whenever I make a change :) . On the other hand, MicroOS (and I'm assuming FCOS) reboots automatically to apply updates with rollback if needed so combined with podman auto-update you can basically just spin up a box, drop the files on, and let it take care of itself (at least until a container update requires manual intervention).

A few things in the article I think might help the author:

1. Podman 4 and newer (which FCOS should definitely have) uses netavark for networking. A lot of older tutorials and articles were written back when Podman used CNI for it's networking and didn't have DNS enabled unless you specifically installed it. I think the default `podman` network is still setup with DNS disabled by default. Either way, you don't have to use a pod if you don't want to anymore, you can just attach both containers to the same network and it should Just Work.

2. You can run the generator manually with "/usr/lib/systemd/system-generators/podman-system-generator --dry-run" to check Quadlet validity and output. Should be faster than daemon-reload'ing all the time or scanning the logs.

And as a bit of self-promotion: for anyone who wants to use Quadlets like this but doesn't want to rebuild their server whenever they make a change, I'm created a tool called Materia[0] that can install, remove, template, and update Quadlets and other files from a Git repository.

[0] https://github.com/stryan/materia

plagiarist•3w ago

Do you know if it is possible to run a quadlet as an ephemeral systemd-sysuser? That would solve all my current problems.

stryan•3w ago

Not sure I'm following; you want to create a an emphemeral system account and run a root-less Podman container as it? I don't think that's something supported out of the box but you may be able to jury rig something together by putting the quadlets directly in `/etc/containers/systemd/users/` instead of putting them in a home directory (since I'm assuming this is a systemd-sysuser created account and thus without a home).

plagiarist•3w ago

Yes, that's it. Have things running isolated by a sysuser as well as in a rootless container. I would be running containers for LAN software (like forgejo) where I'd rather have the data on disk or in a podman volume instead of in a home directory.

amluto•3w ago

> I’ve later learned that restarting a container that is part of a pod will have the (to me, unexpected) side-effect to restart all the other containers of that pod.

Anyone know why this is? Or, for that matter, why Kubernetes seems to work like this too?

I have an application for which the natural solution would be to create a pod and then, as needed, create and destroy containers within the pod. (Why? Because I have some network resources that don’t really virtualize, so they can live in one network namespace. No bridges.)

But despite containerd and Podman and Kubernetes kind-of-sort-of supporting this, they don’t seem to actually want to work this way. Why not?

stryan•3w ago

Yeah I was a little confused at this line; as far as I can tell you can restart containers that are a part of a Podman pod without restarting the whole pod just fine. I just verified this on one of my MicroOS boxes running Podman v5.7.1 .

Podman was changing pretty fast for a while so it could be an older version thing, though I'd assume FCOS is on Podman 5 by now.

gucci-on-fleek•3w ago

> Anyone know why this is?

In Podman, a pod is essentially just a single container; each "container" within a pod is just a separate rootfs. So from that perspective, it makes sense, since you can't really restart half of a container. (But I think that it might be possible to restart individual containers within a pod; but if any container within a pod fails, then I think that the whole pod will automatically restart)

> Why? Because I have some network resources that don’t really virtualize, so they can live in one network namespace.

You can run separate containers in the same network namespace with the "--network" option [0]. You can either start one container with its own automatic netns and then join the other containers to it with "--network=container:<name>", or you can manually create a new netns with "podman network create <name>" and then join all the containers to it with "--network=<name>".

[0]: https://docs.podman.io/en/latest/markdown/podman-run.1.html#...

amluto•3w ago

> You can run separate containers in the same network namespace with the "--network" option [0].

Oh, right, thanks. I think I did notice that last time I dug into this. But:

> or you can manually create a new netns with "podman network create <name>" and then join all the containers to it with "--network=<name>".

I don’t think this has the desired effect at all. And the docs for podman network connect don’t mention pods at all, which is odd. In general, I have not been very impressed by podman.

Incidentally, apptainer seems to have a more or less first class ability to join an existing netns, and it supports CNI. Maybe I should give it a try.

gucci-on-fleek•3w ago

> > or you can manually create a new netns with "podman network create <name>" and then join all the containers to it with "--network=<name>".

> I don’t think this has the desired effect at all.

Well I'm not entirely sure what effect you're wanting here, but I use this option for some of the containers that I run, and it makes it so that all containers in that network can reach each other, while anything outside that network can't. You can also use "--network=ns:/run/user/$UID/netns/<file-name>" to join a container to a manually created network namespace (created with "ip netns add <file-name>") if you need more control.

amluto•3w ago

I think you are confusing a logical network with a network namespace. A logical network is a construct of Docker or CNI or Netavark or whatever that acts kind of like a LAN. A network namespace is a collection of processes that the kernel treats as being one logical machine for networking purposes.

When you make a docker network that is attached three containers with hostnames a, b, and c, then those hostnames logically belong to the “network” (and your container engine may put considerable effort into making it possible for them to resolve each other and communicate), but there is one network namespace each for a, b, and c, for three total.

In Podman, but apparently not Docker, you can do --network container:foo to join another container’s netns. I assume that the reason that Podman and containerd support this low-level feature is that they both support some form of pod.

gucci-on-fleek•3w ago

I think that you're correct that I'm mistaken; thanks for the correction (and sorry for the mistake).

amluto•3w ago

No worries. And your reply caused me to re-remember the --network container:id podman feature, which I might actually try using.

(I generally dislike podman because of stuff like this:

https://github.com/containers/buildah/issues/6460#issuecomme...

Podman is absolutely nicer than Docker in that it integrates better with systemd (if that’s your cup of tea) and it’s less painful to use for non-root development purposes.)

nagaiaida•3w ago

> In Podman, but apparently not Docker, you can do --network container:foo to join another container’s netns

that is precisely how you accomplish the same thing in docker

https://docs.docker.com/engine/network/#container-networks

kace91•3w ago

>Anyone know why this is? Or, for that matter, why Kubernetes seems to work like this too?

Pods are specifically not wanted to be treated as vms, but as a single application/deployment units.

Among other things, if a container goes down you don’t know if it corrupted shared state (leaving sockets open or whatever). So you don’t know if the pod is healthy after restart. Also reviving it might not necessarily work, if the original startup process relied on some boot order. So to guarantee a return to healthy you need to restart the whole thing.

amluto•3w ago

> Among other things, if a container goes down you don’t know if it corrupted shared state (leaving sockets open or whatever).

This is not a thing. A program that opens a socket and crashes does not leak that socket for the lifetime of the network namespace. (Keep in mind that ordinary non-containerized servers usually have exactly one network namespace. If a program crashes, you restart it. Sure, CLOSE_WAIT is a thing, but it’s neither permanent nor usually a big deal.)

kace91•3w ago

Fair, it was probably a bad specific example.

The general point remains that a container can leave behind inconsistent state (lockfiles, application level stuff in shared volumes, whatever).

The larger point is that if something broke by container going down, it is not necessarily solved just by container going back up. Satisfying boot order requirements is another example.

The system relies on the "pod healthy/not healthy" contract, with pod-level restart as a fix when unhealthy; Introducing a spectrum of readyness like 'half-broken-but-internally-attempting-rebuild' would make everything more complex, both for the orchestrator deciding when to reset, and for the dev who no longer has a single point of entry for 'make sure that we're ready to go'.

amluto•3w ago

I’m talking specifically about network namespaces. There are no lockfiles there. There may or may not be a “boot order” but this would be strictly the order of startup of containers within that netns/pod.

esseph•3w ago

The general idea is you want a single application per pod, unless you need a sidecar service to live in the same pod of each instance of your app.

You are normally running several instances of your frontend so that it can crash without impacting the user experience, or so it can get deployed to in a rolling manner, etc.

amluto•3w ago

I’m fine with this being the general idea. But it seems a bit unfortunate to make it be the only idea.

> You are normally running several instances of your frontend so that it can crash without impacting the user experience, or so it can get deployed to in a rolling manner, etc.

Err, the classic way to do this is to hand off the listening socket from one server instance to the next. You can’t do this if your orchestration tools insist on tearing down the entire network namespace to update the server. Sure, you can use fancy load balancers or software defined networking or firewall kludges to hand off something that functions like a listening socket, but it kind of feels like we lost the plot somehow. The old techniques work, and they often worked at the appropriate scale for the application — why are we building new systems can’t be made to work well without extra layers.

In any event, the feature I want isn’t rocket science. I think Kubernetes would need to add two special kinds of Pods:

1. An joinable Pod that explicitly permits other Pods to join with it (this would be a genuine Pod with some special attributes).

2. A subsidiary Pod that depends on a joinable Pod and joins its network namespace. This would almost be a real pod except that it would have no network namespace of its own and hence no normal managed hostname or addresses.

#2 is a bit weird, but there’s precedent. A hostNetwork: true Pod is already weird in exactly the same way.

esseph•3w ago

The kubernetes design is really around stateless apps. Integrated application load balancer, multiple copies of the application running simultaneously, etc. You can fit statefil things in there but it used to be very difficult.

andrewmcwatters•3w ago

I concede that this is the state of the art in secure deployments, but I’m from a different age where people remoted into colocated hardware, or at least managed their VPSs without destroying them every update.

As a result, I think developers are forgetting filesystem cleanliness because if you end up destroying an entire instance, well it’s clean isn’t it?

It also results in people not knowing how to do basic sysadmin work, because everything becomes devops.

The bigger problem I have with this, is the logical conclusion is to use “distroless” operating system images with vmlinuz, an init, and the minimal set of binaries and filesystem structure you need for your specific deployment, and rarely do I see anyone actually doing this.

Instead, people are using a hodgepodge of containers with significant management overhead, that actually just sit on like Ubuntu or something. Maybe alpine. Or whatever Amazon distribution is used on ec2 now. Or of course, like in this article, Fedora CoreOS.

One day, I will work with people who have a network issue and don’t know how to look up ports in use. Maybe that’s already the case, and I don’t know it.

irishcoffee•3w ago

> The bigger problem I have with this, is the logical conclusion is to use “distroless” operating system images with vmlinuz, an init, and the minimal set of binaries and filesystem structure you need for your specific deployment, and rarely do I see anyone actually doing this.

In the few jobs I’ve had over 20 years, this is common in the embedded space, usually using yocto. Really powerful, really obnoxious tool chain.

bitwize•3w ago

What you describe is from the "pets" era of server deployment, and we are now deep into the "cattle" era. Train yourself on destroying and redeploying, and building observability into the stack from the outset, rather than managing a server through ssh. Every shop you go to professionally is going to work like this. Eventually, Linux desktops will work like this also, especially with all the work going into systemd to support movable home directories, immutable OS images with modular updates, and so forth.

andrewmcwatters•3w ago

I already do this professionally, and when something is broken, we collectively as an industry have no idea why except for rolling back to a previous deployment because we have no time for system introspection, nor do we really want to spend engineering hours figuring it out. Just nuke it.

The bigger joke is everyone behaves like they have a ranch for all this cattle infrastructure.

In reality, the largest clients by revenue in the world have PetSmart. And frankly many of them, a fish bowl.

bigstrat2003•3w ago

> What you describe is from the "pets" era of server deployment, and we are now deep into the "cattle" era.

You still need to be able to work with individual servers. Saying "they're cattle, not pets" is just being a lazy sysadmin.

32kb•3w ago

I don't think this viewpoint is very pragmatic. "Pet" and "cattle" approaches solve different scales of problems. Shops should be adaptable to using either for the right job.

crawshaw•3w ago

The idea that an "observability stack" is going to replace shell access on a server does not resonate with me at all. The metrics I monitor with prometheus and grafana are useful, vital even, but they are always fighting the last war. What I need are tools for when the unknown happens.

The tool that manages all my tools is the shell. It is where I attach a debugger, it is where I install iotop and use it for the first time. It is where I cat out mysterious /proc and /sys values to discover exotic things about cgroups I only learned about 5 minutes prior in obscure system documentation. Take it away and you are left with a server that is resilient against things you have seen before but lacks the tools to deal with the future.

gear54rus•3w ago

Agreed, this sounds like some complicated ass-backwards way to do what k8s already does. If it's too big for you, just use k3s or k0s and you will still benefit from the absolutely massive ecosystem.

But instead we go with multiple moving parts all configured independently? CoreOS, Terraform and a dependence on Vultr thing. Lol.

Never in a million years I would think it's a good idea to disable SSH access. Like why? Keys and non-standard port already bring China login attempts to like 0 a year.

ValdikSS•3w ago

>What I need are tools for when the unknown happens.

There are tools which show what happens per process/thread and inside the kernel. Profiling and tracing.

Check Yandex's Perforator, Google Perfetto. Netflix also has one, forgot the name.

reactordev•3w ago

Or… you build a container, that runs exactly what you specify. You print your logs, traces, metrics home so you can capture those stack traces and error messages so you can fix it and make another container to deploy.

You’ll never attach a debugger in production. Not going to happen. Shell into what? Your container died when it errored out and was restarted as a fresh state. Any “Sherlock Holmes” work would be met with a clean room. We have 10,000 nodes in the cluster - which one are you going to ssh into to find your container to attach a shell to it to somehow attach a debugger?

toast0•3w ago

> We have 10,000 nodes in the cluster - which one are you going to ssh into to find your container to attach a shell to it to somehow attach a debugger?

You would connect to any of the nodes having the problem.

I've worked both ways; IMHO, it's a lot faster to get to understanding in systems where you can inspect and change the system as it runs than in systems where you have to iterate through adding logs and trying to reproduce somewhere else where you can use interactive tools.

My work environment changed from an Erlang system where you can inspect and change almost everything at runtime to a Rust system in containers where I can't change anything and can hardly inspect the system. It's so much harder.

IgorPartola•3w ago

Say you are debugging a memory leak in your own code that only shows up in production. How do you propose to do that without direct access to a production container that is exhibiting the problem, especially if you want to start doing things like strace?

joshuamorton•3w ago

I will say that, with very few exceptions, this is how a lot of $BigCo manage everyday. When I run into an issue like this, I will do a few things:

- Rollback/investigate the changelog between the current and prior version to see which code paths are relevant

- Use our observability infra that is equivalent to `perf`, but samples ~everything, all the time, again to see which codepaths are relevant

- Potentially try to push additional logging or instrumentation

- Try to better repro in a non-prod/test env where I can do more aggressive forms of investigation (debugger, sanitizer, etc.) but where I'm not running on production data

I certainly can't strace or run raw CLI commands on a host in production.

reactordev•3w ago

Combined with stack traces of the events, this is the way.

If you have a memory leak, wrap the suspect code in more instrumentation. Write unit tests that exercise that suspect code. Load test that suspect code. Fix that suspect code.

I’ll also add that while I build clusters and throw away the ssh keys, there are still ways to gain access to a specific container to view the raw logs and execute commands but like all container environments, it’s ephemeral. There’s spice access.

zinodaur•3w ago

> I certainly can't strace or run raw CLI commands on a host in production.

Have you worked the other way before? Where you have ssh access to machines (lots of them, when you need to do something big) that have all of your secrets, can talk to all of your dbs, and you can just compile + rsync binaries on to them to debug/repro/repair?

To me, being without those capabilities just feels crippling

joshuamorton•3w ago

> Have you worked the other way before? Where you have ssh access to machines (lots of them, when you need to do something big) that have all of your secrets, can talk to all of your dbs, and you can just compile + rsync binaries on to them to debug/repro/repair?

A lot of the problems I enjoy solving specifically relate to consistently minimizing privilege, not from a security perspective (though there are obvious upsides to this), but from a debugging/clarity perspective. If you have a relatively small and statically verifiable set of (networked) dependencies, and minimize which resources which containers can access, reasoning about the system as a whole becomes a lot easier.

I can think of lots of cases where I've witnessed good outcomes from moving towards more fine-grained resource access, and very few cases where life has gotten better by saying "everyone has access to everything".

zinodaur•3w ago

> A lot of the problems I enjoy solving specifically relate to consistently minimizing privilege

You are my perfect foil :)

> very few cases where life has gotten better by saying "everyone has access to everything"

I should have been more clear - I like the dev env where people have access to the things they are responsible for. E.g., as a maintainer/operator of service X, you can do all the things service X can do. So it's not like random employees are running binaries that interact with your db - only the small set of experts responsible for maintaining that service (also the people most inclined to be cautious, since they own the impact).

It does require you to trust the people operating their services, and requires those people to be careful and competent, but it can yield spectacular results.

The hacker thing mentioned by a sibling comment is definitely true though. I airgap my work machine, never browse the web on it and require fingerprint scans whenever sshing/rsyncing in to prod, but even then its pretty sketch.

I feel like its important to remember how powerful it is though - I want something like ssh/rsync access to a machine with a vlan tag that only lets it perform "safe" db/service interactions - hashing PII and stopping writes. But instead I get "observability" and half assed webuis, stale/redacted datalakes, and minutes long read-eval-print loop iterations with a coworker PR stamp required each iteration

reactordev•3w ago

If you can do those things in production, so can Lee Hong Quag in North Korea. I’d rather not have that capability in production and rely on proper CI/CD to deploy resources into the cloud. The way you like to work is like giving hackers a complete jump box into your organization. You are bound to get hacked, it’s only a matter of time.

ValdikSS•3w ago

>It is where I attach a debugger, it is where I install iotop and use it for the first time. It is where I cat out mysterious /proc and /sys values to discover exotic things about cgroups I only learned about 5 minutes prior in obscure system documentation.

It is, SSH is indeed the tool for that, but that's because until recently we did not have better tools and interfaces.

Once you try newer tools, you don't want to go back.

Here's the example of my fairly recent debug session:

    - Network is really slow on the home server, no idea why
    - Try to just reboot it, no changes
    - Run kernel perf, check the flame graph
    - Kernel spends A LOT of time in nf_* (netfilter functions, iptables)
    - Check iptables rules
    - sshguard has banned 13000 IP addresses in its table
    - Each network packet travels through all the rules
    - Fix: clean the rules/skip the table for established connections/add timeouts

You don't need debugging facilities for many issues. You need observability and tracing.

Instead of debugging the issue for tens of minutes at least, I just used observability tool which showed me the path in 2 minutes.

crawshaw•3w ago

How did you use tracing to check the current state of a machine’s iptables rules?

ValdikSS•3w ago

In this case I used `perf` utility, but only because the server does not have a proper observability tool.

Take a look at this Netflix presentation, especially on the screenshots of their web interface tool: https://archives.kernel-recipes.org/wp-content/uploads/2025/...

crawshaw•3w ago

That is a command line tool run over ssh. If you have invented a new way to run command line tools, that’s great (and very possible, writing a service that can fork+exec and map stdio), but it is the equivalent to using ssh. You cannot run commands using traces.

charcircuit•3w ago

With that mindset anything is equivalent to ssh. The command line is not the pinnacle of user interfaces and giving admins full control of the machine isn't the pinnacle of security either.

We need to accept that UNIX did not get things right decades ago and be willing to evolve UX and security to a better place.

crawshaw•3w ago

Happy to try an alternative. Traces I have tried, and it is not an alternative.

IgorPartola•3w ago

See I would not reboot the server first before figuring out what is happening. You lose a lot of info by doing that and the worst thing that can happen is that the problem goes away for a little bit.

butvacuum•3w ago

most failstates arent worth preserving in a SMB environment. In larger environments or ones equipped for it a snapshot can be taken before rebooting- should the issue repeat.

Once is chance, twice is coincidence, three times makes a pattern.

Ferret7446•3w ago

Alternatively, if it doesn't happen again it's not worth fixing, if it does happen again then you can investigate it when it happens again.

gerdesj•3w ago

To be fair, turning it off and on again is unreasonably effective.

I recently diagnosed and fixed an issue with Veeam backups that suddenly stopped working part way through the usual window and stopped working from that point on. This particular setup has three sites (prod, my home and DR), and five backup proxies. Anyway, I read logs and Googled somewhat. I rebooted the backup server - no joy, even though it looked like the issue was there. I restarted the proxies and things started working again.

The error was basically: there are no available proxies, even though they were all available (but not working but not giving off "not working" vibes).

I could bother with trying to look for what went wrong but life is too short. This is the first time that pattern has happened to me (I'll note it down mentally and it was logged in our incident log).

So, OK, I'll agree that a reboot should not generally be the first option. Whilst sciencing it or nerding harder is the purist approach, often a cheeky reboot gets the job done. However, do be aware that a Windows box will often decide to install updates if you are not careful 8)

akerl_•3w ago

No, you didn’t diagnose and fix an issue.

You just temporarily mitigated it.

abrookewood•3w ago

Sometimes that is enough - especially for home machines etc.

akerl_•3w ago

I’ve got no problem with somebody choosing to mitigate something instead of fixing it. But it’s just incorrect to apply a blind mitigation and declare that you’ve diagnosed the problem.

butvacuum•3w ago

what's the ROI on that?

-- leadership

rurban•3w ago

Turning it off and on again is risky. I recently upgraded a robot in Australia, had problems with systemd, so I turned it off. And had to wait a few weeks until it could be turned on again, because tailscaled was not setup persistently, the routing was not setup properly (over a phone), the machine had some problems,...

High risk, low reward. But of course the ultimate test if it's properly setup.

rurban•3w ago

High risk, low reward. But of course the ultimate test if it's properly setup.

But on the other hand, with my tiny hard real-time embedded controllers, a power cycle is the best option. No persistent state, fast power up, reboot in milliseconds. Every little SW error causes a reboot, no problem at all.

ValdikSS•3w ago

I've debugged so many issues in my life that sometimes I'd prefer things to just work, and if reboot helps to at least postpone the problem, I'd choose that :D

butvacuum•3w ago

seriously, and sometimes it's just not worth investigating. which means its never going to get fixed, and I'd rather go home than create another ticket that'll just get stale and age out.

galleywest200•3w ago

My job as a DevOps engineer is to ensure customer uptime. If rebooting is the fastest, we do that. Figuring out the why is the primary developers’ jobs.

This is also a good reason to log everything all the time in a human readable way. You can get services up and then triage at your own pace after.

My job may be different than other’s as I work at an ITSP and we serve business phone lines. When business phones do not work it is immediately clear to our customers. We have to get them back up not just for their business but for the ability for them to dial 911.

tolciho•3w ago

> This is also a good reason to log everything all the time in a human readable way. You can get services up and then triage at your own pace after.

Unless, hypothetically, the logging velocity tickles kernel bugs and crashes the system, but only when the daemon is started from cron and not elsewhere. Hypothetically, of course.

Or when the system stops working two weeks after launch because "logging everything" has filled up the disk, and took two weeks to so do. This also means important log messages (perhaps that the other end is down) might be buried in 200 lines of log noise and backtrace spam per transaction, which in turn might delay debugging and fixing or at isolating at which end of the tube the problem resides.

zufallsheld•2w ago

Yeah, and then it probably isn't the developers job to fix that but rather the DevOps engineer's one.

Also saying "the developer has to fix this" is something we tried to abolish when talking about DevOps. What about shared responsibility? Bridging the knowledge gap.

gerdesj•3w ago

I fail to understand how your approach is different to your parent.

perf is a shell tool. iptables is a shell tool. sshguard is a log reader and ultimately you will use the CLI to take action.

If you are advocating newer tools, look into nft - iptables is sooo last decade 8) I've used the lot: ipfw, ipchains, iptables and nftables. You might also try fail2ban - it is still worthwhile even in the age of the massively distributed botnet, and covers more than just ssh.

I also recommend a VPN and not exposing ssh to the wild.

Finally, 13,000 address in an ipset is nothing particularly special these days. I hope sshguard is making a properly optimised ipset table and that you running appropriate hardware.

My home router is a pfSense jobbie running on a rather elderly APU4 based box and it has over 200,000 IPs in its pfBlocker-NG IP block tables and about 150,000 records in its DNS tables.

ValdikSS•3w ago

>perf is a shell tool. iptables is a shell tool. sshguard is a log reader and ultimately you will use the CLI to take action.

Well yes, and to be honest in this case I did that all over SSH: run `perf`, generate flame graph, copy the .svg to the PC over SFTP, open it in the file viewer.

What I really wanted is a web interface which will just show me EVERYTHING it knows about the system in a form of charts, graphs, so I can just skim through it and check if everything allright visually, without using the shell and each individual command.

Take a look at Netflix presentation, especially on their web interface screenshots: https://archives.kernel-recipes.org/wp-content/uploads/2025/...

>look into nft - iptables is sooo last decade

It doesn't matter in this context: iptables is using new netfilter (I'm not using iptables-legacy), and this exact scenario is 100% possible with native netfilter nft.

>Finally, 13,000 address in an ipset is nothing particularly special these days

Oh, the other day I had just 70 `iptables -m set --match-set` rules, and did you know how apparently inefficient source/destination address hashing algorithm for the set match is?! It was debugged with perf as well, but I wish I just had it as a dashboard picture from the start.

I'm talking about ~4Gbit/s sudden limitation on a 10Gbit link.

gerdesj•3w ago

>Oh, the other day I had just 70 `iptables -m set --match-set` rules, and did you know how apparently inefficient source/destination address hashing algorithm for the set match is?! It was debugged with perf as well!

>I'm talking about ~4Gbit/s sudden limitation on a 10Gbit link.

I think you need to look into things if 70 IPs in a table are causing issues, such that a 10Gb link ends up at four Gb/s. I presume that if you remove the ipset, that 10Gb/s is restored?

Testing throughput and latency is also quite a challenge - how do you do it?

gerdesj•3w ago

"What I really wanted is a web interface which will just show me EVERYTHING it knows about the system in a form of charts, graphs, so I can just skim through it and check if everything allright visually, without using the shell and each individual command."

Yes, we all want that. I've been running monitoring systems for over 30 years and it is quite a tricky thing to get right. .1.3.1.4.1.33230 is my company enterprise number, which I registered a while back.

The thing is that even though we are now in 2026, monitoring is still a hard problem. There are, however, lots of tools - way more than we had in the day but just like a saw can rip your finger off instead of cutting a piece of wood, well I'm sure you can fill in the blanks.

Back in the day we had a thing called Ethereal which was OK and nearly got buried. However you needed some impressive hardware to use it. Wireshark is a modern marvel and we all have decent hardware. SNMP is still relevant too.

Although we have stonking hardware these days, you do also have to be aware of the effects of "watching". All those stats have to be gathered and stashed somewhere and be analysed etc. That requires some effort from the system that you are trying to watch. That's why things like snmp and RRD were invented.

Anyway, it is 2026 and IT is still properly hard (as it damn well should be)!

kalaksi•3w ago

> What I really wanted is a web interface which will just show me EVERYTHING it knows about the system in a form of charts, graphs, so I can just skim through it and check if everything allright visually, without using the shell and each individual command.

For this reason, I've created Lightkeeper: https://github.com/kalaksi/lightkeeper to simplify repetitive tasks and provide an efficient view for monitoring. Also has graphs as a recent addition, but screenshots don't show it. You can also drop to a terminal with a hotkey any time.

Ironically, it works over SSH without any additional daemons.

kelnos•3w ago

That only works if the people who built the observability tool have thought of everything. They haven't, of course; no one can.

It's great that you were able to solve this problem with your observability tools. But nothing will ever be as comprehensive as what you can do with shell access.

I don't get what the big deal is here. Just... use shell access when you need it. If you have other things in place that let you easily debug and fix some classes of issues, great. But some things might be easier to fix with shell access, and you could very easily run into something you can't figure out without ssh.

Completely disabling shell access is just making things harder for you. You don't get brownie points or magical benefits from denying yourself that.

DANmode•3w ago

> That only works if the people who built the observability tool have thought of everything. They haven't, of course; no one can.

but the tool draws data from forums of people who have had the problem I’m having before.

johnisgood•3w ago

Your example is a shell debugging session. You ran perf, checked iptables, inspected sshguard - all via SSH (or locally). The "observability tool" here is shell access to system utilities.

This proves the parent's point: when the unknown happens, you need a shell.

jeffbee•3w ago

I guess the question is why your observability stack isn't exposing proc and sys for you.

crawshaw•3w ago

Mine (prometheus) doesn’t because there are a lot of high-dimensional values to track in /proc and /sys that would blow out storage on a time-series database. Even if they did though, they could not let me actively inject changes to a cgroup. What do you suggest I try that does?

jeffbee•3w ago

Experience from another company where I (and you) worked suggests that having the endpoints to expose the system metrics, without actually collecting and storing them, is the way to go.

crawshaw•3w ago

Years of debugging in that company’s restricted environments solidified my desire for shell access to production environments. I was there a month before I was hunting for breadcrumbs in a BINARY_INFO log that I had five minutes to grab before it was deleted.

jeffbee•3w ago

Well that's funny you mentioned it because one of my projects was a service that let users temporarily install binary info logs collectors triggered by predicates, remotely, which at least I thought was a better model than ssh into the host or, for the advanced caveman, pdsh into many hosts. I don't really see a reason why I can't do that for gRPC, either ...

But, anyway, remote command and control of observability really is a thing in the industry, not just at one company.

cyberax•3w ago

Because you're holding it wrong!

The dashboards are something that looks cool, but they usually are not really helpful for debugging. What you're looking for is per-request tracing and logging, so you can grab a request ID and trace it (get log messages associated with it) through multiple levels of the stack. Even maybe across different services.

Debuggers are great, but they are not a good option for production traffic.

raggi•3w ago

Yep.

Observability stacks are a similar blind alley to containers: They solve a handful of defined problems and immediately fall down on their own KPI's around events handled/prevented in-place, efficiency, easier to use than what came before.

cryptonector•3w ago

The problem lies in surveillance and others understanding what you did. Say your security department records every shell interaction with prod services: how does one then review and understand what happened? This is a fairly tricky problem. Perhaps through it at an LLM, but it'd have to be well trained to look for malicious actions.

gucci-on-fleek•3w ago

Fedora IoT [0] is a nice intermediate solution. Despite its name, it's really good for servers, since it's essentially just the Fedora Atomic Desktops (Silverblue/Kinoite) without any of the desktop stuff. It gets you atomic updates, a container-centric workflow, and easy rollbacks; but it's otherwise a regular server, so you can install RPMs, ssh into it, create user accounts, and similar. This is what I do for my personal server, and I'm really happy with it.

[0]: https://fedoraproject.org/iot/

dorfsmay•3w ago

Perfect timing for me, I've just been spending my side-project time in the last few weeks on building the smallest possible VMs with different glibc distros exactly for this, running podman containers, and comparing results.

starttoaster•3w ago

So it's AWS Fargate with a different name? That's cool for cloud hosted stuff. But if you're on prem, or manage your own VPS' then you need SSH access.

npodbielski•3w ago

I bought last year mobo with IPMI so in theory I could have forgot about SSH and just inspect startup logs if it would fail to start.

Though I must say I am not brave enough and my family uses it so I prefer to have jest one broken service instead of enire machine.

But it is possible.

denkmoon•3w ago

Except you've replaced something good with something worse. IPMI really isn't an improvement over having SSH to the system. It definitely has more security holes.

npodbielski•3w ago

Can you SSH into broken grub? Can you change BIOS settings? Also giving access like that outside your home network is not a good idea. So security issues does not matter that much.

denkmoon•3w ago

I didn't say not to use IPMI, I said it's not a security improvement over SSH. For exactly the reason you point out, giving access outside your home. Nothing wrong with exposing SSH provided you take it seriously and know what you're doing. Nobody in their right mind would ever put IPMI on anything but a protected isolated network.

yigalirani•3w ago

real programmers can ssh to their servers

libHacker•3w ago

It's true. There's no reason to disable ssh. If you need it, it's there. If not, just don't use it.

npodbielski•3w ago

That is looks interesting. An idea to configure server on run via symtemd would probably mean that migrating from machine to machine would be very easy. It always meant for me at least two days of carefull planning, copying od files testing and fixes because I always forgot about some obscure config changes I did somewhere, like adding DNS entry somewhere or disabling default SMTP on debian.

skeptic_ai•3w ago

You can try docker compose with Watch tower. Then you just deploy a new branch: dev, prod. On server side counterparty you fetch updates on git, if anybody change, it will run docker compose, which will build your image and put it live.

Worked well for me a few years.

Problems: when you have issues you need to look into pertainer logs to see why it failed.

That’s one big problem, if prefer something like Jenkins to build it instead.

And if you have more groups of docker compose, you just put another sh script to do this piling on the main infrastructure git repo, which on git change will spawn new git watchers

hahahahhaah•3w ago

It is not just fine, it is best practice. SSH is 2020s version of 2010s driving to the data centre.

gerdesj•3w ago

I'm glad you have a stack that works for you. The great thing is we have choice and it was not always so. I suggest that you be careful of the DevOps way. Sometimes a "pet" is the way to go, especially if you only have one. If you have a thundering herd then you'll be hand rolling your own nonsense with the best of the cloudy cowboys and have a out of service sign that says "they did it" for when the lights wink out!

I also notice that the word security does not grace your blog posting. That is a sure sign of the DevOps Way 8) You might look into the sysadmin way. Its boring, to be sure: all that fussing over security and the like!

You could look into VPNs for access to your gear. An IPSEC, OpenVPN or Wireguard seems to keep most baddies away simply because it is a lot of effort to even engage with one. There are a huge number of ways that a VPN is configured. Then you have ssh, which can be very securely configured (or not).

You can also use firewalls and I'm sure you do. If you have a static IP at home then simply filter for that. Make use of allow/deny lists - there are loads for firewalls of all sorts.

Dumping remote shell access is not useful.

VladVladikoff•3w ago

This is right up there with people who are happy to not have root access on their phones.

Zopieux•3w ago

"tech blogger not reinvent NixOS but badly" challenge (impossible)

deathanatos•3w ago

> [Kubernetes…] Managed clusters could make things easier, but they aren’t cheap. That would defeat the initial motivation behind retiring moana.

Kubernetes is perfectly fine running on a "cluster" with a single node / this seems to be under the misconception that k8s requires >1 node. It doesn't, though obviously a single or two node cluster will not maintain a majority if a node goes down. For self-hosting, though, that might be perfectly acceptable.

(My own self-hosted server was a single-node k8s cluster.)

k_bx•3w ago

I am also slowly preparing myself to the world where there is no SSH into the server machine. I am following what's happening around IncusOS. Already sold on Incus for my containers, it does make sense on a paper: safe auto-updates, no manual key management, all you need is managed via API in a cluster (usually).

mvdwoord•3w ago

Can you tell me a bit more on how you use Incus.. is it just personal use, or otherwise? What type of workloads do you run on it, and how is your networking setup / experience?

k_bx•3w ago

I use it for professional use, running production services which don't require 99.999 availability and have relatively low traffic, basically some internal dashboards and tools.

I develop my programs as deb/systemd packages and deploy in "fat" ubuntu/debian incus containers.

I have a cluster with three machines where one is used for build-containers and second for production containers. I am looking forward having time to have ZFS streaming incremental backup of the containers.

For big servers I use Proxmox, which is great, but Incus (and IncusOS) feel a bit more futuristic, where Proxmox is more bullet-proof enterprise solution.

lynx97•3w ago

Stockholm syndrome

jo-m•3w ago

I have a somewhat similar setup on the application layer (rootless podman, quadlet), but it's NixOS, and there still is SSH ;)

https://github.com/jo-m/fluffy

ThePowerOfFuet•3w ago

>Now that tinkerbell is up and running, I cannot even SSH into it. In fact, nothing can.

What's the point? Disable password auth, key only, and leave it be until the day you need a shell.

Then nobody but you can SSH into it.

We Mourn Our Craft

I Write Games in C (yes, C)

Hoot: Scheme on WebAssembly

SectorC: A C Compiler in 512 bytes

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

The AI boom is causing shortages everywhere else

Al Lowe on model trains, funny deaths and working with Disney

The Waymo World Model

Reinforcement Learning from Human Feedback

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

History and Timeline of the Proco Rat Pedal (2021)

Selection Rather Than Prediction

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Hackers (1995) Animated Experience

Making geo joins faster with H3 indexes

Sheldon Brown's Bicycle Technical Info

We Mourn Our Craft

I Write Games in C (yes, C)

Hoot: Scheme on WebAssembly

SectorC: A C Compiler in 512 bytes

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

The AI boom is causing shortages everywhere else

Al Lowe on model trains, funny deaths and working with Disney

The Waymo World Model

Reinforcement Learning from Human Feedback

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

History and Timeline of the Proco Rat Pedal (2021)

Selection Rather Than Prediction

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Hackers (1995) Animated Experience

Making geo joins faster with H3 indexes

Sheldon Brown's Bicycle Technical Info

I Cannot SSH into My Server Anymore (and That's Fine)

Comments