Maintenance strategy for low-volume appliance deployments?




TL;DR: looking for low-overhead, maintainable hardware+software stack for low-volume deployment behind NAT.

I’m in a software/ops business; we’re a small shop, but we’re looking into dipping our toes into building our own appliances. At this stage all we need can probably be served with a RasPi4 or an APU2. A GigE port is the crucial component, since the entire value is derived from the device simply being present on-site and pushing some bytes around (sustained 10-20mbps), perhaps close to 24×7.

So in terms of HW, we’re looking for:

  • Off-the-shelf, relatively easy to source in EU
  • OK to make some compromises on CPU, memory, disk, but NOT networking
  • Not looking for mass volume, at this stage maybe a dozen deployments
  • Lowest possible price is not that important

…But of course, the above requirements can shift in the future.

In terms of software, our existing backend stack is all Python in Docker on AWS, but we’re likely to build from scratch for these devices, and we’re open to trying different approaches – what we’re concerned with is however the ability to iterate fast, smooth rollouts/rollbacks, better resource utilisation (Python is not great here), remote management (many of these devices are going to sit behind a NAT, sometimes with absolutely NO option to accept external connections), ability to recover from a partial screwup (minimise the risk of complete bricking), hassle with OS/third-party security updates, etc.

I’ve been looking at gokrazy, but the platform support at this time is somewhat limited (and no RasPi4), and that would lock us in to basically a single supported device (apu2c4). Alpine is also looking great – small, hardened, and somewhat familiar (to anyone who’s been working a lot with Docker).

I’m also concerned with remote management. I have the most experience with Ansible, and honestly it’s because of that experience that I’m fairly certain I would prefer something much more simple and lightweight – but I’d rather avoid building a tool in-house. The basic requirements are just to pull the new binary, restart the service, execute a healthcheck, and roll the hell back if it broke things. The rest could probably sit in the application, since it’d be driven by a centralised C&C backend and otherwise remain stateless. Of course this leaves the question of OS updates wide open.

I’d appreciate any thoughts / insight / war stories.

submitted by /u/rollc_at
[link] [comments]





Original article: Maintenance strategy for low-volume appliance deployments?
Author: /u/rollc_at