Why do we need serverless Chef?
Managing a set of servers using Chef can save you lots of time and frustration, but it has its limits.
Our team kept bumping up against one of these limits. While developers worked simultaneously on multiple git branches, our Chef-server stored a single set of configurations. These were used for all our servers.
Since we uploaded changes to the chef-server cookbook by cookbook and role by role, often from feature branches, it was easy for the data on our chef-server to get out of sync with what was in github. We thought of locking the chef-server to our ‘main’ branch by allowing uploads to the chef-server only through an automated CD (Continuous Deployment) process but this prevented us from being able to quickly test changes to cookbooks on individual nodes in the production environment. I talso prevented us from making pull-requests for use in our code-review and testing process.
What we all really wanted was an ability to apply git branches to individual nodes. This too has its dangers and pitfalls. It’s possible for another git branch to get behind the main branch and miss out on critical updates. We also would lose the certainty that all servers were provisioned using the same configuration. We decided that, with some additional monitoring to show where we had deployed a branch other than ‘main,’ these issues were worth the gains we would have from finding a way to run serverless chef.
What did we have to overcome?
Our chef-server provided a single-point-of-truth for:
- node configurations (knife node)
- cookbooks and roles (knife path-based commands)
- finding servers by role and recipe (knife search)
- performing remote operations on multiple servers (knife ssh)
In order to run without a chef-server, we had to find relatively stable workarounds for each of these points. Here’s how we handled them.
We decided that we didn’t really need these. We didn’t store these in our git repo anyway and they added unnecessary complexity. We decided that if a node required its own configuration we could create a new role just for it.
What we did need was a way of storing an association between a node and its role. We already had a distributed key/value (k/v) store (HashiCorp Consul) installed on all nodes in our infrastructure, so we decided to store the association there. This allowed nodes to know about the role of every other node.
We then needed to load the data into a format that our existing chef recipes would work with. We created a custom recipe that would run on all nodes, loading data about other nodes from consul and creating the objects for Chef to use. Here’s a quick peek at part of that process:
Node.new.tap do |node|
The k/v store also allowed us to store a JSON object for each node, with data such as custom ports. These would be useful when writing configurations on other nodes. Hopefully in future, we’ll be able to rewrite recipes in a way that requires less data to be stored here. This could be achieved by sourcing port info from registered Consul services, but that’s something for another post.
Cookbooks and Roles
For ease of operation, we created a bash script that:
- Uses a deploy key to pull down a shallow clone of a particular branch of our chef git repo into a temp dir (single-branch mode with a depth of 1).
- Reads data from locally written config files and writes a simple JSON config file for Chef to use.
- Calls chef-client from the cloned directory with the “local,” “node-name” and “json-attributes” flags.
Finding servers by role and recipe
Since we had node information stored in Consul k/v, we were able to create a rake task that iterates through the run-lists of all roles/*.json files to create an inverted index of roles and recipes, allowing for relatively quick searches. This search would be even faster if we temporarily cached the inverted index but for a first iteration, loading and parsing this data on each call ensures data is up to date, keeps the process simple and has been fast enough for our purposes. It can become problematic if roles exist on one branch and not another.
Performing remote options on multiple servers
We built a further script running on the “find” command above, and made use of the same multi-threading library that’s behind “knife ssh” (net/ssh/multi) to run commands across multiple servers simultaneously.
Chickens and Eggs
We found a few edge-cases where we were unsure of where we should start. Here is how we handled some of them.
For serverless-chef to work, we needed our chef code on each node, but how could we pull the git repo if we didn’t have a deploy key available? Wouldn’t it be great if Chef itself could create a user with the key to pull down the chef-repo from github? We created a rake task that zipped and copied the current repo over to the node and then called our serverless-chef bash script with a “bootstrap” option. This option left out the call to “git clone” and included the “chef-license-accept” command to ensure a smooth first run.
When the Key/Value Store Is Unavailable
Our first thought would be that it would be good to keep running even when consul was unavailable, but this meant that configuration files could be rewritten with incorrect data. The easiest and safest option seemed to be to fail hard when Consul was unavailable. We discovered that it was good to also have a configuration parameter to override this requirement, especially during the bootstrap phase.
Creating New Roles
Originally we thought it would be a good idea to halt a chef-client run immediately if an unknown role for a node was found in the Key/Value store. This resulted in Chef runs failing everywhere because someone deployed a new role on a new branch to a single node.
We eventually decided that the simplest way to handle this was that unknown roles would be considered equivalent to the base role. This allowed chef-client to keep running when a role was only available on a single branch.
Migrating Legacy Nodes
In order to smoothly and safely roll out serverless-chef across more than one hundred servers, we needed to ensure that “serverless nodes” and “legacy nodes” could get metadata about each other. This meant that the same custom recipes for pulling and pushing node metadata needed to be included in the legacy Chef repository. An extra challenge was ensuring that legacy nodes did not get registered from both the Chef-server and the k/v store, which we accomplished by excluding k/v-sourced nodes whose IP addresses were already known to Chef.
Bonus Features we Added in Later
Once our infrastructure was fully migrated and running relatively stably we added the following features to our serverless-chef script:
- Log level. Passing this on to chef-client made debugging much easier.
- Process lock. Of course, chef-client has its own run lock, but adding one to our script ensured that the repo doesn’t get pulled again when the script is already running. We used flock for this.
- Avoiding pulling the same code repeatedly. By comparing hashes of the local and github heads we were able to only pull when changes exist in the github repo.
Running Chef without a central server is not the easiest thing to do but it can be extremely beneficial to developers working in particular environments. If done carefully, moving away from a single central Chef server can also improve stability.
I hope this article helps you and I wish you all the best on your own adventures with serverless-chef.