The objective of this Finder is to find the ElastiCache Redis Clusters that are eligible for the autoscaling policy that AWS has introduced. ElastiCache for Redis clusters may be underutilized. They can be scaled using a recently introduced autoscaling policy.
How does it work?
This Finder is a part of the CloudFix product and will provide the savings available on the AWS accounts that you have registered in CloudFix. After one or more AWS accounts are registered in CloudFix:
- The Finder identifies clusters that are eligible for autoscaling policy based on conditions provided by AWS.
- The Finder determines scaling dimension and calculates the target metric threshold.
The problem of underutilization of clusters will be addressed by the handling of below areas which help build the Finder:
- Choosing Clusters and Dimensions - Selecting the ElastiCache clusters and dimensions (shard/replica) that we shall add the auto scaling policy to.
- Replica Auto Scaling - Setting the auto scaling policy for scaling in/out the number of replicas.
- Shard Auto Scaling - Setting the auto scaling policies for scaling in/out the number of shards.
- Threshold Calculation - Target metric threshold determines when scale in/out happens.
Choosing Clusters and Dimensions
In order to choose the Clusters and Dimensions, CloudFix follows the eligibility criteria that is defined by AWS itself:
- Redis (cluster mode enabled) clusters running Redis engine version 6.0 onwards
- Instance type families - R5, R6g, M5, M6g
- Instance sizes - Large, XLarge, 2XLarge
- Auto Scaling in ElastiCache for Redis is not supported for clusters running in Global datastores, Outposts or Local Zones.
- AWS Auto Scaling for ElastiCache for Redis is available in the following regions: US East (N. Virginia), EU (Ireland), Asia Pacific (Mumbai) and South America (Sao Paulo).
Below are the decisions that are considered in order to make the choice:
- Elasticache instances that don’t have ‘cluster mode’ enabled are skipped because the Redis SDK in use needs to support clustering (AWS Blog post).
- Elasticache instances that are not eligible for AWS auto scaling policy are skipped because if CloudFix modifies them, then it can lead to downtimes.
- Elasticache clusters that have a scheduled auto scaling policy attached are skipped because CloudFix wouldn’t have an idea of historical metrics and to develop a proper fixed schedule for scaling the clusters.
- If a cluster has a scaling policy defined, then CloudFix will overwrite the existing policy unless the existing policy uses the same dimension and is more frugal. CloudFix compares the policy that it has built based on metrics to the one that exists for the cluster. If the cluster’s policy uses a different dimension, CloudFix removes it and installs its policy. But if the existing policy uses the same dimension, but a higher or same threshold and lower or same number of shards/replicas, then we keep it, otherwise we replace it with our policy to reduce resource usage.
- To keep things simple, CloudFix always creates an auto scaling policy regardless of current usage. This can help save cost if cluster load decreases in the future.
- Since Redis has 2 scaling dimensions (number of replicas and number of shards), CloudFix uses the dimension that has the most scale-in potential as it would scale in the cluster under light load and scale it back to initial configuration under the high load.
A Redis replication group consists of a single primary node which the application can both read from and write to, and from 1 to 5 read-only replica nodes. Whenever data is written to the primary node it is also asynchronously updated on the read replica nodes. Below is some information around this that CloudFix has considered around auto-scaling of the same:
- CloudFix doesn’t create a replica auto scaling policy if no replicas are configured in the cluster because we can’t save any costs by adding a policy if the number of replicas is already 0.
- The number of replicas is never reduced below 1 replica and 1 master. This is also in-line with AWS compliance. Note: We don’t reduce the number of replicas directly, but rather create a policy that defines min and max numbers.
- In order to scale replicas, CloudFix uses AWS’ own ElastiCacheReplicaEngineCPUUtilization metrics as it helps simplify maintenance.
ElastiCache for Redis with Cluster Mode Enabled works by spreading the cache key space across multiple shards. This means that the data and read/write access to that data is spread across multiple Redis nodes. By spreading the load over a greater number of nodes, we can both enhance availability and reduce bottlenecks during periods of peak demand, while providing more memory space than a single node could offer. By using online resharding and shard rebalancing we can scale Redis cluster dynamically with no downtime (AWS source). Below are some of the points that CloudFix considers for this implementation:
- In the policy that CloudFix defines, the number of shards considered are the ones that the cluster has currently. If the number of shards is 1, CloudFix doesn’t create a shard auto scaling policy at all.
- The pre-defined AWS metrics used in this case are ElastiCachePrimaryEngineCPUUtilization and ElastiCacheDatabaseMemoryUsageCountedForEvictPercentage. This helps CloudFix simplify maintenance and avoid doing any fine tuning as AWS would do that.
CloudFix replays the past utilization to detect and adjust to sharp increases in utilization. CloudFix calculates threshold and minimum number of shards/replicas based on metric dynamics in the past 4 weeks. This is also an AWS recommendation (Refer the
Defining Target Value point in the Auto Scaling ElastiCache for Redis clusters AWS documentation).