Update - A small number (less than 5 per day) of Terraform Runs continue to be effected by sporadic DNS errors. Next week we'll continue root cause analysis in pursuit of a conclusive fix.
Feb 25, 00:57 UTC
Update - Having observed error rates today, we have determined that the DNS configuration changes did not improve the situation. These changes have been rolled back, and co-tenancy has been reduced further. We will continue to monitor DNS errors closely.
Feb 22, 00:49 UTC
Update - The new DNS configuration for Terraform Runs has been rolled out. We will be monitoring closely for DNS errors today.
Feb 21, 18:25 UTC
Update - Reduced co-tenancy density has reduced the DNS failure rate to near-zero. We have some additional config changes planned for early next week which should take care of the remaining few errors.
Feb 18, 00:27 UTC
Monitoring - The DNS config rollback has significantly reduced the rate of DNS failures in Terraform runs as compared to yesterday. We believe the root cause of the failures is packet loss stemming from a performance bottleneck in our isolation layer. We have reduced the co-tenancy density in that layer in the hopes that this reduces pressure on the bottleneck and stops the failures altogether. We will be continually monitoring this issue throughout the day and will continue to update this incident as we have more information.
Feb 17, 17:04 UTC
Update - An alternative DNS config rolled out today proved to exacerbate the rate of DNS lookup failures. That config change has now been rolled back on the Terraform workers.

We have now determined that the problem is not DNS specific, but is in fact related to packet loss within the virtual networking of our isolation layer.

We will continue to dig in and update this incident as we make progress on the root cause analysis.
Feb 16, 22:52 UTC
Investigating - We are performing an ongoing investigation into sporadic DNS lookup failures occurring within Terraform Runs and Packer Builds. Our initial attempts at mitigation of this issue via DNS configuration changes have thus far not shown to be effective in preventing the lookup failures. We will update this incident with results of our continued investigation as we have more information.
Feb 16, 16:38 UTC
Atlas ? Operational
Packer Builds ? Operational
Terraform Runs ? Operational
Logging ? Operational
SCADA ? Operational
File Storage ? Operational
GitHub Operational
AWS ec2-us-east-1 Operational
AWS ec2-us-west-1 Operational
Twilio Outgoing SMS Operational
Twilio REST API Operational
Application compilation ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
System Metrics Month Week Day
Atlas Availability ?
Fetching
File Storage Availability ?
Fetching
Logging Availability ?
Fetching
SCADA Availability
Fetching
Past Incidents
Feb 24, 2017

No incidents reported.

Feb 23, 2017

No incidents reported.

Feb 22, 2017
Resolved - We have narrowed the scope of the root cause to an issue with a deployment tool and we have queued up a fix to prevent this from happening again.
Feb 22, 16:45 UTC
Investigating - A problem with a deploy caused a momentary outage in our frontend web application. We have recovered and are investigating the root cause.
Feb 22, 15:57 UTC
Feb 20, 2017

No incidents reported.

Feb 19, 2017
Resolved - The database upgrade is complete and successful. All services are restored and performing normally.
Feb 19, 21:17 UTC
Identified - In the next few minutes we will be performing a mandatory database upgrade, which will have the following effects on service:

* Terraform Runs and Packer Builds will queue up for a window of time.
* Frontends may respond with errors as the database reboots.

We expect the upgrade event will take less than 15 minutes. We will update this incident once the maintenance completes, or if anything unexpected happens during the upgrade.
Feb 19, 20:56 UTC
Feb 16, 2017
Resolved - Terraform worker capacity has been increased to address queue delays. Queue wait times are normal.
Feb 16, 22:03 UTC
Identified - Terraform queue times are again higher than normal. We are in the process of actively spinning up capacity to address this.
Feb 16, 18:44 UTC
Resolved - Terraform queue wait times have returned to normal. Investigation suggests that the delays were not caused by a failure but rather a capacity issue. We will scale up Terraform worker capacity to prevent issues like this from occurring in the future.
Feb 16, 15:26 UTC
Investigating - We're investigating queue delays in processing Terraform Runs.
Feb 16, 15:04 UTC
Feb 15, 2017

No incidents reported.

Feb 14, 2017
Resolved - A follow-on fix has been deployed which further improves the performance of Terraform page loads. Performance has been stable.
Feb 14, 19:31 UTC
Monitoring - A mitigating fix is out now. Performance has returned to normal and we are continuing to monitor.
Feb 14, 17:44 UTC
Investigating - Performance has dropped again. Our team is continuing to investigate.
Feb 14, 17:27 UTC
Monitoring - The fix has been deployed and we're monitoring the improving frontend performance.
Feb 14, 17:21 UTC
Identified - We are tracking an issue affecting performance of Terraform Enterprise frontend page loads. We're in the process of deploying a fix.
Feb 14, 17:09 UTC
Feb 13, 2017

No incidents reported.

Feb 12, 2017

No incidents reported.

Feb 11, 2017

No incidents reported.