Containers revolutionized how we deploy applications, but this abstraction has a cost. Inadequate configurations can transform a performant application into a bottleneck. This article explores how to optimize performance in containerized environments.
A container is not a virtual machine. Optimizing as if it were is the first mistake.
The Impact of Containers on Performance
Real overhead
Native application: 100 RPS baseline
Docker application: 95-98 RPS (2-5% overhead)
K8s application: 90-95 RPS (5-10% overhead)
The overhead is small, but configuration mistakes amplify it drastically:
Misconfigured container: 50-70 RPS
→ The problem isn't the container, it's the configuration
Sources of overhead
- Network: overlay networks, NAT, iptables
- Storage: copy-on-write, volumes
- CPU: cgroups, scheduling
- Memory: limits, OOM killer
Configuring Resources Correctly
CPU: requests vs limits
resources:
requests:
cpu: "500m" # Guaranteed by scheduler
limits:
cpu: "1000m" # Maximum allowed
Common mistakes:
# ❌ Limit too low - constant throttling
limits:
cpu: "100m"
# ❌ No limit - can monopolize node
limits:
cpu: null
# ❌ High Request = Limit - waste
requests:
cpu: "2000m"
limits:
cpu: "2000m"
Recommended configuration:
# ✅ Request based on average usage, limit for peaks
resources:
requests:
cpu: "250m" # Observed average usage
limits:
cpu: "1000m" # Headroom for peaks
CPU Throttling
When a container reaches its CPU limit, it suffers throttling:
No throttling: p99 latency = 50ms
With throttling: p99 latency = 500ms (10x worse)
How to detect:
# Container metrics
cat /sys/fs/cgroup/cpu/cpu.stat
# nr_throttled: number of times throttled
# throttled_time: total time in nanoseconds
Prometheus query:
rate(container_cpu_cfs_throttled_seconds_total[5m])
Memory: the delicate balance
resources:
requests:
memory: "512Mi"
limits:
memory: "1Gi"
The OOM Killer problem:
Memory used > Limit → OOM Kill → Pod restarts
→ Lost connections, in-flight requests fail
Safe configuration:
# ✅ 20-30% headroom above normal usage
resources:
requests:
memory: "512Mi" # Average usage
limits:
memory: "768Mi" # +50% headroom
JVM in Containers
Older JVMs don't respect container limits:
# Container with 1GB limit
# Old JVM sees 64GB from host machine
# Allocates 16GB heap → Instant OOM Kill
Solution:
# Use JVM 11+ which respects cgroups
FROM eclipse-temurin:17-jre
# Or configure explicitly
ENV JAVA_OPTS="-XX:MaxRAMPercentage=75.0"
Recommended JVM configuration:
env:
- name: JAVA_OPTS
value: >-
-XX:MaxRAMPercentage=75.0
-XX:InitialRAMPercentage=50.0
-XX:+UseG1GC
-XX:+UseContainerSupport
Network Optimization
Service mesh overhead
Without service mesh: latency = 5ms
With Istio sidecar: latency = 8-12ms (+60-140%)
When it's worth it:
- Distributed observability
- Mandatory mTLS
- Complex traffic management
When to avoid:
- Ultra-low latency critical
- High volume of internal requests
- Simplicity is priority
DNS lookup
Each request can do DNS lookup:
Request → DNS lookup (2-5ms) → Connection → Response
Optimization:
# Configure dnsPolicy
spec:
dnsPolicy: ClusterFirst
dnsConfig:
options:
- name: ndots
value: "2" # Reduces unnecessary lookups
Connection pooling
# ❌ New connection per request
# TCP handshake + TLS = 50-100ms per request
# ✅ Connection pool
# Reuses established connections
Pool configuration:
// Node.js with connection pool
const pool = new Pool({
max: 20,
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 2000
});
Storage Optimization
Storage types and performance
emptyDir (memory): ~500MB/s
emptyDir (disk): ~100MB/s
hostPath: ~100MB/s
PersistentVolume: ~50-100MB/s (depends on provider)
Copy-on-Write overhead
Image layers use CoW:
# ❌ Many layers = lots of CoW
RUN apt-get update
RUN apt-get install -y package1
RUN apt-get install -y package2
# ✅ Fewer layers
RUN apt-get update && \
apt-get install -y package1 package2 && \
rm -rf /var/lib/apt/lists/*
Logs and performance
# ❌ Logs to stdout without limit
# Disk I/O grows indefinitely
# ✅ Limit container logs
apiVersion: v1
kind: Pod
spec:
containers:
- name: app
# Configure log rotation in runtime
Docker daemon config:
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
}
}
Kubernetes-Specific Optimizations
Pod Disruption Budget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: my-app
Topology Spread
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-app
Readiness vs Liveness
# Liveness: is the app alive?
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
# Readiness: can the app receive traffic?
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Common mistake:
# ❌ Readiness probe too heavy
readinessProbe:
httpGet:
path: /health # Checks DB, cache, external APIs
periodSeconds: 1 # Every second!
# = DDoS yourself
# ✅ Light readiness with adequate frequency
readinessProbe:
httpGet:
path: /health/ready # Basic check
periodSeconds: 5
Graceful Shutdown
spec:
terminationGracePeriodSeconds: 30
containers:
- name: app
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"]
// Application must handle SIGTERM
process.on('SIGTERM', async () => {
console.log('Received SIGTERM, shutting down gracefully');
await server.close();
await db.close();
process.exit(0);
});
Optimized Docker Image
Multi-stage build
# Build stage
FROM node:18 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
# Runtime stage
FROM node:18-slim
WORKDIR /app
COPY --from=builder /app/node_modules ./node_modules
COPY . .
USER node
CMD ["node", "server.js"]
Base image
# ❌ Heavy image
FROM ubuntu:22.04 # ~77MB
# ✅ Optimized image
FROM alpine:3.18 # ~7MB
# ✅ Distroless (even smaller, more secure)
FROM gcr.io/distroless/nodejs:18 # ~40MB, no shell
Startup time
Large image (500MB): pull = 30-60s
Optimized image (50MB): pull = 3-6s
Performance Monitoring
Essential metrics
# Container-level
- container_cpu_usage_seconds_total
- container_memory_usage_bytes
- container_network_receive_bytes_total
- container_fs_reads_bytes_total
# Application-level
- http_request_duration_seconds
- http_requests_total
- process_resident_memory_bytes
Minimum dashboard
1. CPU Usage vs Request vs Limit
2. Memory Usage vs Request vs Limit
3. CPU Throttling
4. Pod Restarts
5. Network I/O
6. Disk I/O
Conclusion
Performance in containers depends on:
- Correct resources: requests and limits based on real data
- Avoid throttling: CPU throttling destroys latency
- Memory with headroom: OOM kills cause instability
- Optimized network: DNS, connection pools, conscious service mesh
- Lean images: fewer layers, smaller size, fast startup
Before blaming the container:
- Check for CPU throttling
- Confirm there are no OOM kills
- Analyze network metrics
- Compare with baseline outside the container
A container is a tool, not a villain. Poor container performance is usually poor performance amplified.