Build and Deploy System Requirements
Guiding principles
- Cattle, not pets
- Least access
- Prevent casual browsing of secret information
- Script or server should only have exactly enough permissions to do what it needs
- Everything encrypted
- In flight (SSL)
- On disk (typically AWS encrypted EBS)
- IaC (Infrastructure as Code)
- This includes the job itself - the job must be defined as source-controlled code. In other words, if Jenkins, the job must be defined as a Jenkinsfile (Jenkins Pipeline). If CodePipeline, this is buildspec.yml and a CF that defines the CodePipeline and CodeBuild instances.
- Build & run locally the same as production (reduce “works on my laptop” syndrome)
- Clear purpose
- Server / environment serves a clear purpose (e.g. preprod runs the same code as prod)
Build requirements
- Standard branch & merge strategy (e.g. master with PRs; no direct commit to master)
- Ref: QA requirements (e.g. code coverage, test reports, etc)
- Source code downloaded from GitHub via GitHub SSH key (which is stored and managed as a secret / credential)
Build Server Requirements
- The underlying file system must be encrypted
- For EC2 instances: The latest AMI (<= 24 hours old). This is how we patch the OS, stay current with base software like Splunk, prevent pets, and prove IaC
- For docker containers: Alpine-based image (exceptions require approval). Use “latest” topic (to automatically keep up with security patches)
- Add the bare minimum set of software required to build your service. Use latest versions (to keep up with security patches)
- Run as an IAM role with the least privileges necessary to complete the build. They do not directly have the power to affect production - they must assume a special production-deployment role in order to deploy artifacts
- Human access to the build system must be tied to corp auth (typically SAML)
Artifact Image Requirements
- Must be a container. EC2 instances are no longer supported.
- Install minimum set of software required for service to run
- Do not leave compilers installed in the image
- Try to leave interpreted languages (e.g. Python, Ruby, Perl) out of the image
Container Instance Requirements
- Run as an IAM role with the least privileges necessary for the function to bootstrap and run
Secret Information
- No use of AWS API keys (use IAM)
- Passwords and the like must be stored either as Jenkins credentials or as a secret in AWS parameter store
- Adhere to least access principle here
Artifacts
- Either a container image in ECR or appropriate binary/binaries in nexus1
- ECR repositories are owned by their services
- The repo used for deployment must be RO for everyone except the build system
- Only the build system can publish artifacts to that area (meaning: developers do not have access to push an image – would mean a developer could push an image to prod)
Deployment
- Rolling upgrades with rollback-on-failure (by checking shallow health check)
- Gated deployments: Deploy to a pre-prod environment, perform a fast (<60 seconds) smoke test, then deploy to prod
- Currently, the deployment artifact must be a container. Minor exceptions are allowed (such as scheduled lambdas)
- Containers must run in an existing ECS cluster (typically the global cluster)
- Deployment code must be a cloudformation (which contains all service-specific resources such as the ECS service, lambdas, KMS keys, IAM roles, container configuration, load balancer configuration, etc)
- Must include the construction of dependent services like RDS and manage the creation of passwords, storing them as a secret, and allowing the dependee service to retrieve it, etc.
- Only builds from master that have passed all gates can be promoted and pushed to pre-prod and then prod. Branches don’t get deployed anywhere (today) but the artifact can be pushed (under a different name) so it can be used by others
- BODM: Build Once, Deploy Many
- The same artifact built and tested must be the same one deployed to prod. In other words, don’t do a build, deploy, and test of develop, merge to master, build, test, and deploy that. When something is “accepted” (by automatic or manual gates), that same artifact (in our case, container images) must be the thing that’s deployed to production. Otherwise you’re not deploying what you tested.
Feedback
- Team must have insight into every build (status, reason, logs)
Operability
- The build system SLA must minimally match the SLA for the services is deploys
- e.g. if the service is in production with 24x7x365 99.99% SLA, then the build system is expected to match so that patches to production aren’t affected by build server downtime
- Similarly, monitoring needs to be in place to ensure this is met and that the right people are notified if an outage occurrs.