Post-mortem: We have identified an issue with resource locking that prevented resources in our pool (i.e. database or cache access, message broker, API gateway) from being released once the autoscalling group detected drops in the service usage and deactivated some of the instances; once the service usage increased in certain periods of the day and more instances were loaded, as the resources pool was holding several locks from old instances, it got overloaded and could not supply the new instances with access to the needed resources, taking the service down. Once the disruption was identified, a backup instance was loaded and the resources pool was reset to serve the application during the maintenance period.
What we changed: The autoscalling has been reconfigured and the rules were rewritten. A minimum of 3 instances (instead of 1) will always be running with a failover-enabled main load-balancer that will alert us and stop sending requests to failing instances if something goes wrong with one of them. The clearing system has been moved to a separate server, with no public access. Patches were applied to prevent the resources pool from holding orphan locks, including changes in the code base for releasing locks and timeouts in the locks. More health checks were added. All CapSettle environments were moved to the new setup with no downtime during the maintenance window.
What we want to achieve: ensure the service is always up, even in case of minor issues; ensure the service is up in high load; ensure high loads do not affect order clearing; ensure similar issues are identified and fixed faster in the future, with minimum to zero downtime.
Next steps: we will keep watching and tracking the new setup and mainly the resources pool for correct garbage collecting and service availability.