xRunBooks for SRE

  • AWS Add Lifecycle Policy to S3 Buckets: Attaching lifecycle policies to AWS S3 buckets enables us to automate the management of object lifecycle in your storage buckets. By configuring lifecycle policies, you can define rules that determine the actions to be taken on objects based on their age or other criteria. This includes transitioning objects to different storage classes, such as moving infrequently accessed data to lower-cost storage tiers or archiving them to Glacier, as well as setting expiration dates for objects. By attaching lifecycle policies to your S3 buckets, you can optimize storage costs by automatically moving data to the most cost-effective storage tier based on its lifecycle. Additionally, it allows you to efficiently manage data retention and comply with regulatory requirements or business policies regarding data expiration. This runbook helps us find all the buckets without any lifecycle policy and attach one to them.

  • AWS Change AWS EBS Volume To GP3 Type: This runbook can be used to change the type of an EBS volume to GP3(General Purpose 3). GP3 type volume has a number of advantages over it's predecessors. gp3 volumes are ideal for a wide variety of applications that require high performance at low cost

  • AWS Change AWS Route53 TTL: For a record in a hosted zone, lower TTL means that more queries arrive at the name servers because the cached values expire sooner. If you configure a higher TTL for your records, then the intermediate resolvers cache the records for longer time. As a result, there are fewer queries received by the name servers. This configuration reduces the charges corresponding to the DNS queries answered. However, higher TTL slows the propagation of record changes because the previous values are cached for longer periods. This Runbook can be used to configure a higher value of a TTL .

  • AWS Create IAM User with policy: Create new IAM user with a security Policy. Sends confirmation to Slack.

  • AWS Delete EBS Volume Attached to Stopped Instances: EBS (Elastic Block Storage) volumes are attached to EC2 Instances as storage devices. Unused (Unattached) EBS Volumes can keep accruing costs even when their associated EC2 instances are no longer running. These volumes need to be deleted if the instances they are attached to are no more required. This runbook helps us find such volumes and delete them.

  • AWS Delete EBS Volume With Low Usage: This runbook can help us identify low usage Amazon Elastic Block Store (EBS) volumes and delete these volumes in order to lower the cost of your AWS bill. This is calculates using the VolumeUsage metric. It measures the percentage of the total storage space that is currently being used by an EBS volume. This metric is reported as a percentage value between 0 and 100.

  • AWS Delete ECS Clusters with Low CPU Utilization: ECS clusters are a managed service that allows users to run Docker containers on AWS, making it easier to manage and scale containerized applications. However, running ECS clusters with low CPU utilization can result in wasted resources and unnecessary costs. AWS charges for the resources allocated to a cluster, regardless of whether they are fully utilized or not. By deleting clusters that are not being fully utilized, you can reduce the number of resources being allocated and lower the overall cost of running ECS. Furthermore, deleting unused or low-utilization clusters can also improve overall system performance by freeing up resources for other applications that require more processing power. This runbook helps us to identify such clusters and delete them.

  • AWS Delete AWS ELBs With No Targets Or Instances: ELBs are used to distribute incoming traffic across multiple targets or instances, but if those targets or instances are no longer in use, then the ELBs may be unnecessary and can be deleted to save costs. Deleting ELBs with no targets or instances is a simple but effective way to optimize costs in your AWS environment. By identifying and removing these unused ELBs, you can reduce the number of resources you are paying for and avoid unnecessary charges. This runbook helps you identify all types of ELB's- Network, Application, Classic that don't have any target groups or instances attached to them.

  • AWS Delete IAM profile: This runbook is the inverse of Create IAM user with profile - removes the profile, the login and then the IAM user itself..

  • AWS Delete Old EBS Snapshots: Amazon Elastic Block Store (EBS) snapshots are created incrementally, an initial snapshot will include all the data on the disk, and subsequent snapshots will only store the blocks on the volume that have changed since the prior snapshot. Unchanged data is not stored, but referenced using the previous snapshot. This runbook helps us to find old EBS snapshots and thereby lower storage costs.

  • AWS Delete RDS Instances with Low CPU Utilization: Deleting RDS instances with low CPU utilization is a cost optimization strategy that involves identifying RDS instances with consistently low CPU usage and deleting them to save costs. This approach helps to eliminate unnecessary costs associated with running idle database instances that are not being fully utilized. This runbook helps us to find and delete such instances.

  • AWS Delete Unattached AWS EBS Volumes: This runbook can be used to delete all unattached EBS Volumes within an AWS region. You can delete an Amazon EBS volume that you no longer need. After deletion, its data is gone and the volume can't be attached to any instance. So before deletion, you can store a snapshot of the volume, which you can use to re-create the volume later.

  • AWS Delete Unused AWS Secrets: This runbook can be used to delete unused secrets in AWS.

  • AWS Delete Unused AWS Log Streams: Cloudwatch will retain empty Log Streams after the data retention time period. Those log streams should be deleted in order to save costs. This runbook can find unused log streams over a threshold number of days and help you delete them.

  • AWS Delete Unused NAT Gateways: This runbook search for all unused NAT gateways from all the region and delete those gateways.

  • AWS Delete Unused Route53 HealthChecks: When we associate healthchecks with an endpoint, Amazon Route53 sends health check requests to the endpoint IP address. These health checks validate that the endpoint IP addresses are operating as intended. There may be multiple reasons that healtchecks are lying usused for example- health check was mistakenly configured against your application by another customer, health check was configured from your account for testing purposes but wasn't deleted when testing was complete, health check was based on domain names and hence requests were sent due to DNS caching, Elastic Load Balancing service updated its public IP addresses due to scaling, and the IP addresses were reassigned to your load balancer, and many more. This runbook finds such healthchecks and deletes them to save AWS costs.

  • AWS AWS Detach EC2 Instance from ASG: This runbook can be used to detach an instance from Auto Scaling Group. You can remove (detach) an instance that is in the InService state from an Auto Scaling group. After the instance is detached, you can manage it independently from the rest of the Auto Scaling group. By detaching an instance, you can move an instance out of one Auto Scaling group and attach it to a different group. For more information, see Attach EC2 instances to your Auto Scaling group.

  • AWS AWS EC2 Disk Cleanup: This runbook locates large files in an EC2 instance and backs them up into a given S3 bucket. Afterwards, it deletes the files backed up and send a message on a specified Slack channel. It uses SSH and linux commands to perform the functions it needs.

  • AWS AWS Ensure Redshift Clusters have Paused Resume Enabled: This runbook finds redshift clusters that don't have pause resume enabled and schedules the pause resume for the cluster.

  • AWS AWS Get unhealthy EC2 instances from ELB: This runbook can be used to list unhealthy EC2 instance from an ELB. Sometimes it difficult to determine why Amazon EC2 Auto Scaling didn't terminate an unhealthy instance from Activity History alone. You can find further details about an unhealthy instance's state, and how to terminate that instance, by checking the a few extra things.

  • AWS List unused Amazon EC2 key pairs: This runbook finds all EC2 key pairs that are not used by an EC2 instance and notifies a slack channel about them. Optionally it can delete the key pairs based on user configuration.

  • AWS Release Unattached AWS Elastic IPs: A disassociated Elastic IP address remains allocated to your account until you explicitly release it. AWS imposes a small hourly charge for Elastic IP addresses that are not associated with a running instance. This runbook can be used to deleted those unattached AWS Elastic IP addresses.

  • AWS AWS Restart unhealthy services in a Target Group: This runbook restarts unhealthy services in a target group. The restart command is provided via a tag attached to the instance.

  • AWS Copy AMI to All Given AWS Regions: This runbook can be used to copy AMI from one region to multiple AWS regions using unSkript legos with AWS CLI commands.We can get all the available regions by using AWS CLI Commands.

  • AWS Delete Unused AWS NAT Gateways: This runbook can be used to identify and remove any unused NAT Gateways. This allows us to adhere to best practices and avoid unnecessary costs. NAT gateways are used to connect a private instance with outside networks. When a NAT gateway is provisioned, AWS charges you based on the number of hours it was available and the data (GB) it processes.

  • AWS Detach EC2 Instance from ASG: This runbook can be used to detach an instance from Auto Scaling Group. You can remove (detach) an instance that is in the Service state from an Auto Scaling group. After the instance is detached, you can manage it independently from the rest of the Auto Scaling group. By detaching an instance, you can move an instance out of one Auto Scaling group and attach it to a different group. For more information, see Attach EC2 instances to your Auto Scaling group.

  • AWS Detect ECS failed deployment: This runbook check if there is a failed deployment in progress for a service in an ECS cluster. If it finds one, it sends the list of stopped task associated with this deployment and their stopped reason to slack.

  • AWS Enforce Mandatory Tags Across All AWS Resources: This runbook can be used to Enforce Mandatory Tags Across All AWS Resources.We can get all the untag resources of the given region,discovers tag keys of the given region and attaches mandatory tags to all the untagged resource.

  • AWS Handle AWS EC2 Instance Scheduled to retire: To avoid unexpected interruptions, it's a good practice to check to see if there are any EC2 instances scheduled to retire. This runbook can be used to List the EC2 instances that are scheduled to retire. To handle the instance retirement, user can stop and restart it before the retirement date. That action moves the instance over to a more stable host.

  • AWS Monitor AWS DynamoDB provision capacity: This runbook can be used to collect the data from cloudwatch related to AWS DynamoDB for provision capacity.

  • AWS Resize EBS Volume: This run resizes the EBS volume to a specified amount. This runbook can be attached to Disk usage related Cloudwatch alarms to do the appropriate resizing. It also extends the filesystem to use the new volume size.

  • AWS Resize list of pvcs.: This runbook can be used to resize list of pvcs in a namespace. By default, it uses all pvcs to be resized.

  • AWS Resize PVC: This runbook resizes the PVC to input size.

  • AWS Restart AWS EC2 Instances: This runbook can be used to Restart AWS EC2 Instances

  • AWS Launch AWS EC2 from AMI: This lego can be used to launch an AWS EC2 instance from AMI in the given region.

  • AWS Troubleshooting Your EC2 Configuration in a Private Subnet: This runbook can be used to troubleshoot EC2 instance configuration in a private subnet by capturing the VPC ID for a given instance ID. Using VPC ID to get Internet Gateway details then try to SSH and connect to internet.

  • Jenkins Fetch Jenkins Build Logs: This runbook fetches the logs for a given Jenkins job and posts to a slack channel

  • Jira Jira Visualize Issue Time to Resolution: Using the Panel Library - visualize the time it takes for issues to close over a specifict timeframe

  • Kubernetes k8s: Delete Evicted Pods From All Namespaces: This runbook shows and deletes the evicted pods for given namespace. If the user provides the namespace input, then it only collects pods for the given namespace; otherwise, it will select all pods from all the namespaces.

  • Kubernetes k8s: Get kube system config map: This runbook fetches the kube system config map for a k8s cluster and publishes the information on a Slack channel.

  • Kubernetes k8s: Get candidate nodes for given configuration: This runbook get the matching nodes for a given configuration (storage, cpu, memory, pod_limit) from a k8s cluster

  • Kubernetes Kubernetes Log Healthcheck: This RunBook checks the logs of every pod in a namespace for warning messages.

  • Kubernetes k8s: Pod Stuck in CrashLoopBackoff State: This runbook checks if any Pod(s) in CrashLoopBackoff state in a given k8s namespace. If it finds, it tries to find out the reason why the Pod(s) is in that state.

  • Kubernetes k8s: Pod Stuck in ImagePullBackOff State: This runbook checks if any Pod(s) in ImagePullBackOff state in a given k8s namespace. If it finds, it tries to find out the reason why the Pod(s) is in that state.

  • Kubernetes k8s: Pod Stuck in Terminating State: This runbook checks any Pods are in terminating state in a given k8s namespace. If it finds, it tries to recover it by resetting finalizers of the pod.

  • Kubernetes k8s: Resize List of PVCs: This runbook resizes a list of Kubernetes PVCs.

  • Kubernetes k8s: Resize PVC: This runbook resizes a Kubernetes PVC.

  • Kubernetes Rollback Kubernetes Deployment: This runbook can be used to rollback Kubernetes Deployment

  • Postgresql Display long running queries in a PostgreSQL database: This runbook displays collects the long running queries from a database and sends a message to the specified slack channel. Poorly optimized queries and excessive connections can cause problems in PostgreSQL, impacting upstream services.

Last updated