A few months back I received a red alert that our authentication service was failing heathchecks—not good. We use a third-party open-source platform to handle auth that’s deployed with Elastic Container Service (ECS) in AWS. Checking the logs on the ECS dashboard it looked like the service was stuck in a loop trying to start and deploy new tasks with each one failing to start.
Checking our webapp I discovered I was able to log in with no issues so the problematic containers didn't make their way to production, or at least not yet. We got quite lucky here that these containers failed to start. If they did start and were released we would have had an outage on our hands as almost every endpoint in our backend relies on this service for authentication.
Digging a bit deeper into the logs It turns out the containers were not starting because a required env parameter was not present. But what changed? This service has been running fine untouched for months and the env parameters it uses were never updated.
It’s not abnormal for ECS to deploy new containers to a cluster whether that’s for maintenance or auto-scaling. It’s one of the benefits of using ECS to be able to do this automatically with minimal setup and no downtime. But in our case there was a fatal flaw with the setup that caused these issues: not using a specific version tag on the container image.
The Dangers of Using "Latest" Tags
Back when I originally deployed this service I copy+pasted the container image URL from the service’s dev docs into the task definition and went on my way with setting up everything else. It’s not uncommon for Docker deployment instructions in dev docs to use the “latest” tag to specify the version as anyone setting up and trying out the application for the first time probably wants the newest version.
But setting up a service for production versus trying something out locally is a very different process. There can be breaking changes between major versions of applications and by not specifying a specific version you run the risk of finding out about these changes the hard way.
In our case the “latest” version of the auth service when we originally set it up was 4.X but when ECS decided to deploy new containers it pulled the latest version from the Docker registry which at that point was 5.X, a major version that was completely different from 4.X. Even though some images usually support backwards compatibility between versions you can never be sure about this without looking at release notes and testing.
Version Pinning: The Simple Fix
The solution to all of this was to update the task definition to use a fixed version of the container image to ensure that any time it gets re-deployed, the code running is the exact same as before. It seems obvious to say now but it’s also something that can be overlooked especially when you get in the habit of always using the “latest” tag when testing containers out locally.
When I was triaging the issue I had to jump on a call with AWS support to figure out how to get ECS out of the infinite loop of trying to deploy new services that kept failing. They told me that they get support tickets every day for issues that are caused by containers deployed using the “latest” version and it would be their number one piece of advice to always version your images when using ECS.
Other Lessons Learned
1. Use a private container registry, such as AWS’s Elastic Container Registry (ECR) for production services.
Another problem we faced when fixing this service is that ECS began hitting rate limits from the Dockers registry because it was trying to pull the image so many times when stuck in the deployment loop.
Even an hour after stopping the re-deployment loop we were still rate limited and couldn’t pull the image from the Docker registry into ECS. By uploading copies of the images to ECR you can cut out one more potential problem and be sure that the images with the specific versions you want are always available. You can do this pretty easily with the AWS CLI by pulling the image (with the specific version), tagging the image, logging into ECR, and pushing the new tag:
2. Make sure the container versions you use for local development match what you’re running in production.
This is another one that seems obvious to say now but in practice it’s quite easy to throw in a “latest” (or version different from what’s running in production) tag to a docker-compose file that’s used for local development. Not having these versions match can often lead towards code that “works on my machine!” breaking once it hits production.
We were lucky to avoid an outage caused by using the “latest” version tag for a key service running in production. If it weren’t for an env variable that was introduced in the new version, the container would have started successfully, the health checks would have passed, and it would have been deployed to production causing just about everything to go down while we figured out how to fix the issue. It’s easy to get used to always using the “latest” tag for local development but setting up a production service is a very different process.