Building Scalable Microservices: Lessons from Production
After building microservices architectures for multiple high-traffic applications, I've learned some hard lessons about us what works and what doesn't at scale.
The Reality Check
Microservices aren't a silver bullet. They introduce complexity that you need to be prepared to handle:
- Distributed tracing becomes essential
- Network calls are unreliable
- Data consistency is harder
- Deployment coordination gets complex
But when done right, they provide incredible benefits: independent scaling, technology flexibility, and team autonomy.
Pattern 1: API Gateway
Always use an API gateway as the single entry point:
TypeScript programming">// api-gateway/src/index.ts
import express from "express";
import { createProxyMiddleware } from "http-proxy-middleware";
const app = express();
// Route to user service
app.use("/api/users", createProxyMiddleware({
target: process.env.USER_SERVICE_URL,
changeOrigin: true,
pathRewrite: { "^/api/users": "" }
}));
// Route to order service
app.use("/api/orders", createProxyMiddleware({
target: process.env.ORDER_SERVICE_URL,
changeOrigin: true,
pathRewrite: { "^/api/orders": "" }
}));
Pattern 2: Circuit Breaker
Protect your services from cascading failures:
import CircuitBreaker from "opossum";
const options = {
timeout: 3000,
errorThresholdPercentage: 50,
resetTimeout: 30000
};
const breaker = new CircuitBreaker(fetchUserData, options);
breaker.on("open", () => {
console.log("Circuit breaker opened!");
});
breaker.fallback(() => ({
error: "Service temporarily unavailable"
}));
Pattern 3: Event-Driven Communication
Use events for async communication between services:
// publisher.ts
import { EventEmitter } from "events";
class OrderService extends EventEmitter {
createOrder(orderData: OrderData) {
const order = this.saveOrder(orderData);
// Emit event instead of direct service call
this.emit("order.created", {
orderId: order.id,
userId: order.userId,
total: order.total
});
return order;
}
}
// subscriber.ts
class InventoryService {
constructor(orderService: OrderService) {
orderService.on("order.created", this.reserveInventory);
}
async reserveInventory(event: OrderCreatedEvent) {
// Handle inventory reservation
}
}
Pattern 4: Database Per Service
Each service should own its data:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ User │ │ Order │ │ Inventory │
│ Service │ │ Service │ │ Service │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Users │ │ Orders │ │ Inventory │
│ DB │ │ DB │ │ DB │
└─────────────┘ └─────────────┘ └─────────────┘
Pattern 5: Health Checks
Implement proper health checks:
app.get("/health", async (req, res) => {
const health = {
uptime: process.uptime(),
timestamp: Date.now(),
checks: {
database: await checkDatabase(),
redis: await checkRedis(),
externalApi: await checkExternalApi()
}
};
const isHealthy = Object.values(health.checks)
.every(check => check === "ok");
res.status(isHealthy ? 200 : 503).json(health);
});
Common Pitfalls to Avoid
1. Too Many Microservices
Start with a monolith, extract services when you have clear boundaries.
2. Synchronous Communication
Prefer async communication via message queues when possible.
3. Shared Databases
Never share databases between services—it creates tight coupling.
4. No Monitoring
Invest in observability from day one: logging, metrics, tracing.
Production Checklist
✅ API Gateway configured
✅ Service discovery implemented
✅ Circuit breakers in place
✅ Distributed tracing setup
✅ Centralized logging
✅ Health checks on all services
✅ Automated deployment pipeline
✅ Database backups automated
✅ Secrets management configured
✅ Rate limiting implemented
Conclusion
Microservices are powerful but complex. Make sure you need them before committing to the architecture. If you do go with microservices, invest heavily in observability, automation, and developer tooling.
Your future self will thank you.
Service boundaries: the decision that determines everything
90% of failed microservices migrations get this single decision wrong. Service boundaries should follow business capabilities, not technical layers. "Orders," "Inventory," and "Pricing" are services. "Database," "API gateway," and "Cache" are not.
The bounded-context heuristic from domain-driven design holds up: if two teams need to coordinate on every release, they should probably own one service, not two. If a single team owns three services that always deploy together, they should probably be one service.
The smell tests for a wrong boundary:
- Cross-service joins that pull the same data five different ways
- A "common" library that every service depends on and breaks every service when it changes
- A single endpoint that requires synchronous calls to 4+ services to respond
- Deploys that require coordinating across teams to land safely
Any one of these is a sign you've sliced the system the wrong way. Two or more is a refactor.
The distributed systems failure modes
Microservices replace a single shared-memory failure mode (the monolith crashed) with seven new failure modes you now have to design for:
- Network partitions. A call between services can hang indefinitely. Default every client to aggressive timeouts (1–3s for sync calls).
- Cascading failures. Service A retries against degraded service B and amplifies the load. Circuit breakers (Hystrix-style or per-language equivalents) are non-negotiable.
- Thundering herds. A cache miss causes every replica to call the upstream simultaneously. Single-flight or request coalescing is the fix.
- Idempotency drift. Network retries cause duplicate writes. Every mutating endpoint needs an idempotency key — design it on day one, not after the first incident.
- Schema evolution. A producer ships a field rename; three consumers break. Use a schema registry (Confluent, Apicurio) and enforce backward-compatible changes in CI.
- Distributed transactions. Two-phase commit doesn't work at scale. Use the Saga pattern with explicit compensating actions, or design around eventual consistency.
- Clock skew. Don't rely on cross-service timestamps for ordering. Use monotonic IDs or vector clocks where ordering matters.
Each of these is well-documented and well-solved. The teams that struggle are the ones discovering them at 2am instead of designing for them at week zero.
Observability is not optional
A monolith you can attach a debugger to. A microservices system you cannot. Observability is what replaces the debugger — and "observability" specifically means three pillars, not just one:
- Distributed tracing (OpenTelemetry, Jaeger, Honeycomb). Every request gets a trace ID, every service propagates it, and you can see the full call graph for any user-facing latency.
- Structured logs with shared trace IDs. Plaintext logs are write-only at this scale. JSON logs with trace IDs cost the same to emit and are 10x more valuable to query.
- Metrics with high cardinality. Prometheus is the floor; Honeycomb-style wide events are the ceiling. The difference shows up when you need to ask "why is p99 latency high for users in this specific tier on this specific endpoint."
Skip any one of the three and incident response time triples.
Deployment and platform requirements
Microservices increase the deploy frequency you need to support, the orchestration surface, and the security perimeter. Practical requirements:
- A container platform. Kubernetes is the default; ECS/Cloud Run are valid for smaller surface areas.
- A service mesh (Istio, Linkerd) once you cross ~20 services. Below that, library-level retries and mTLS via cert-manager are simpler.
- Centralized secret management (Vault, AWS Secrets Manager, Doppler). Per-service
.envfiles do not scale. - A real CI/CD platform with parallel pipelines, environment promotion, and rollback. GitHub Actions works; Argo CD or Flux for GitOps once you're past 50 deploys/day.
See our AWS architecture and DevOps services-services">DevOps services for how we structure these platforms in production engagements.
When microservices are wrong
We've helped more teams migrate off premature microservices than onto them. The pattern: a series-A team adopted microservices because their last company had them, ended up with 11 services and 4 engineers, spent 60% of engineering time on platform work, and lost a year.
Microservices are usually wrong when:
- Your team is under 15 engineers
- Your traffic is under 1M requests/day
- Your data model has heavy cross-entity joins
- You don't have dedicated platform/infra capacity
- You're pre-PMF
A modular monolith — clear module boundaries inside a single deployable — gives you 80% of the architectural discipline of microservices with 20% of the operational cost. We default to modular monoliths for MVP and SaaS engagements and only extract services when scaling pressure makes the trade-off worth it.
When microservices are right
Microservices are right when:
- Independent team autonomy is more valuable than deployment coordination
- Different services have radically different scaling profiles (e.g. realtime + batch)
- Different services have different compliance requirements (e.g. PCI scope isolation)
- You have a platform team that owns the substrate
If two of these apply, start planning the extraction. If three or more apply, the migration is overdue.
Want a second opinion on your architecture?
WH Studio runs architecture reviews as a 1–2 week engagement: current-state diagram, failure-mode analysis, prioritized recommendations, and a realistic migration sequence. get in touch">Start a conversation or browse our IT consulting and API development practices.
Microservices FAQ
At what team size do microservices start to make sense? Roughly 15+ engineers, organized into 3+ teams that need independent release cadences. Below that, a modular monolith ships faster and breaks less often.
Can we mix monolith and microservices? Yes — and most healthy systems do. A modular monolith with 1–3 extracted services (typically the highest-traffic or most-isolated capabilities) is a stable end state, not an awkward middle.
What's the right size for a single service? Big enough that a single team owns it, small enough that one engineer can hold the whole thing in their head. Typically 10K–50K lines of code; below 5K usually means you over-split.
Should every service have its own database? Yes, conceptually — services should not share tables. Physically co-locating multiple service schemas on one Postgres instance for cost reasons is fine in early stages, as long as the access boundary is enforced in code.
