Some users unable to access Hyra

Incident Report for Hyra

Postmortem

On Sunday 7th January at approximately 6am UK time, Hyra experienced an outage that prevented some of our customers from accessing our services.

This outage lasted approximately 30 minutes.

What caused the outage?

Our developer had applied some network changes to our subnets, which removed the old IP addresses from our database cluster. After applying these changes, new IP addresses in the new range were assigned to the machines in the cluster.

However, our DNS resolution to other machines in the cluster were still defined as the previous IP addresses, and as these were no longer accessible our database began to failover.

As all of our IP addresses has been changed, there were not enough nodes in our replica to continue operation, causing a panic at API level, resulting in a 500 error at edge.

An additional issue was then present. Our database is set up to only listen on certain IP addresses and interfaces. However, because our DNS resolution was no longer correct, it prevented some of our nodes from listening on all of their interfaces, which prevented access to our database from some of our external services which live on Railway.

How was it fixed?

For now, we have re-attached our previous subnet network back to the machines, in addition to the new network. This ensures that the previous DNS resolution is still working.

At a later date, our developer will remove the legacy subnet and complete a full migration to the new subnet.

To resolve the issue with DNS resolution on our external interface, we have manually added the public IP address of the machine to the network configuration.

How will we prevent this in the future?

We will prevent similar instances of downtime like this in the future by applying changes incrementally. We will also provide a replica network to ensure packets can be routed during the migration.

Posted Jan 07, 2024 - 06:49 GMT

Resolved

This incident has been resolved.

Posted Jan 07, 2024 - 06:34 GMT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 07, 2024 - 06:32 GMT

Update

We are continuing to work on a fix for this issue.

Posted Jan 07, 2024 - 06:15 GMT

Identified

The issue has been identified and a fix is being implemented.

Posted Jan 07, 2024 - 06:09 GMT

Investigating

We are currently investigating this issue.

Posted Jan 07, 2024 - 06:01 GMT

This incident affected: Authentication (Login In & Sign Up), Workspaces (Home, Activity, Assignments), Staff Management (Views, Profiles), and Workspace Admin.