Update - A small number (less than 5 per day) of Terraform Runs continue to be effected by sporadic DNS errors. Next week we'll continue root cause analysis in pursuit of a conclusive fix.
Feb 25, 00:57 UTC
Update - Having observed error rates today, we have determined that the DNS configuration changes did not improve the situation. These changes have been rolled back, and co-tenancy has been reduced further. We will continue to monitor DNS errors closely.
Feb 22, 00:49 UTC
Update - The new DNS configuration for Terraform Runs has been rolled out. We will be monitoring closely for DNS errors today.
Feb 21, 18:25 UTC
Update - Reduced co-tenancy density has reduced the DNS failure rate to near-zero. We have some additional config changes planned for early next week which should take care of the remaining few errors.
Feb 18, 00:27 UTC
Monitoring - The DNS config rollback has significantly reduced the rate of DNS failures in Terraform runs as compared to yesterday. We believe the root cause of the failures is packet loss stemming from a performance bottleneck in our isolation layer. We have reduced the co-tenancy density in that layer in the hopes that this reduces pressure on the bottleneck and stops the failures altogether. We will be continually monitoring this issue throughout the day and will continue to update this incident as we have more information.
Feb 17, 17:04 UTC
Update - An alternative DNS config rolled out today proved to exacerbate the rate of DNS lookup failures. That config change has now been rolled back on the Terraform workers.
We have now determined that the problem is not DNS specific, but is in fact related to packet loss within the virtual networking of our isolation layer.
We will continue to dig in and update this incident as we make progress on the root cause analysis.
Feb 16, 22:52 UTC
Investigating - We are performing an ongoing investigation into sporadic DNS lookup failures occurring within Terraform Runs and Packer Builds. Our initial attempts at mitigation of this issue via DNS configuration changes have thus far not shown to be effective in preventing the lookup failures. We will update this incident with results of our continued investigation as we have more information.
Feb 16, 16:38 UTC