A Developer’s Journey to the Cloud 7: Advanced Database Scaling

After conquering servers with Kubernetes and automating infrastructure with Terraform, I thought I had reached peak scalability. I was wrong.

I had finally reached the summit. Or so I thought. My infrastructure was now code, managed by Terraform. My application was an orchestra, conducted by Kubernetes. I could scale my web servers from three to thirty with a single line change in a YAML file. I could rebuild my entire cloud environment from scratch before my coffee got cold. I had achieved true automation. The system felt less like a fragile application and more like a living, resilient organism.

As more users flocked to the app, I would watch with pride as my Kubernetes cluster effortlessly added more containers to meet the demand. The load balancer distributed the traffic perfectly. The cache absorbed the read-heavy requests. It was a beautiful, self-healing, scalable machine.

But then, the alerts started again. Not the loud, obvious "server down" alerts of my early days, but something more subtle:

P99 Latency Increased
Database Connection Saturation

The application wasn't crashing, but it was getting… sluggish. Pages would occasionally take seconds to load instead of milliseconds. Users were complaining about timeouts during peak hours.

My powerful, infinitely scalable army of application servers was running perfectly. The problem wasn't with the army; it was with what the army was trying to do.

The Bottleneck at the End of the Road

I dove into the monitoring tools provided by my cloud provider, specifically the Amazon RDS dashboard. The picture was immediately clear. My Kubernetes pods were barely breaking a sweat. My cache was healthy. The problem was the one component I thought I had solved: my managed database. Its CPU utilization graph was pinned at 100%. The "Database Connections" metric was hitting the maximum limit.

I had built a ten-lane superhighway of application servers, but it all led to a single-lane toll booth. Every new user, every new post, every write operation in my system had to go through this one, increasingly overwhelmed primary database. My managed service was doing its best it was strong and reliable but still just one.

I had scaled everything else, but I had forgotten to scale my source of truth.

Splitting the Workload: Reads vs. Writes

My first thought was to just throw more money at the problem upgrade to a bigger database instance. But that was vertical scaling, a temporary fix. The real solution had to be horizontal: scale across more than one machine.

Most applications, including mine, perform far more read operations than write operations. This led me to read replicas.

A read replica is an exact, real-time, read-only copy of your primary database. My cloud provider could create one with a few clicks. The plan: direct all SELECT queries to the replica, leaving the primary free for INSERT, UPDATE, DELETE operations.

This required a change in my application's database connection logic. I had to make the code smart enough to know which database to talk to.

// dbRouter.js
const { Pool } = require('pg');

// Connection pool for the primary (write) database
const writePool = new Pool({
  connectionString: process.env.PRIMARY_DB_URL,
});

// Connection pool for the read replica database
const readPool = new Pool({
  connectionString: process.env.READ_REPLICA_URL,
});

module.exports = {
  // Uses writePool for writes
  query: (text, params) => writePool.query(text, params),
  // Uses readPool for reads
  select: (text, params) => readPool.query(text, params),
};

Throughout the app, I updated calls:

// Before
const { rows } = await db.query('SELECT * FROM users WHERE id = $1', [userId]);

// After
const { rows } = await db.select('SELECT * FROM users WHERE id = $1', [userId]);

// Writes stay the same
await db.query('UPDATE users SET last_login = NOW() WHERE id = $1', [userId]);

Deploying this change was transformative. CPU on the primary database dropped by 80%, connection limits were no longer an issue, and the application was fast again. I had effectively doubled database capacity by separating reads and writes.

The Final Frontier: Sharding

Read replicas solved most immediate problems, but what happens when write traffic alone exceeds a single primary database? The answer: sharding.

Sharding splits data across multiple independent primary databases. For AWS RDS, this means creating two or more smaller RDS instances (shard-1, shard-2, etc.). The application acts as a router using a shard key for example, user_id.

function getDbForUser(userId) {
  if (userId % 2 === 0) {
    // Even user IDs go to shard 1
    return dbConnectionForShard1;
  } else {
    // Odd user IDs go to shard 2
    return dbConnectionForShard2;
  }
}

// Fetching a user
const userId = 123; // Odd
const db = getDbForUser(userId); // Returns shard 2 connection
await db.query('SELECT * FROM users WHERE id = $1', [userId]);

With this strategy, write capacity can scale almost infinitely by adding more RDS instances and updating routing logic. The database was no longer a single, magical box it had become a distributed data tier, just like the application tier.

But even with the database healthy, I noticed a new bottleneck. User registration was still slow not due to the database, but because my processes were synchronous: creating a user, resizing profile pictures, adding them to a mailing list, and sending a welcome email to all while the user waited.

My architecture had scaled, but my processes hadn’t. That was the next challenge.

next post: A Developer’s Journey to the Cloud 8: Event-Driven Architecture with RabbitMQ.