Performance in Containers: optimizing containerized applications

Containers revolutionized how we deploy applications, but this abstraction has a cost. Inadequate configurations can transform a performant application into a bottleneck. This article explores how to optimize performance in containerized environments.

A container is not a virtual machine. Optimizing as if it were is the first mistake.

The Impact of Containers on Performance

Real overhead

Native application:     100 RPS baseline
Docker application:     95-98 RPS (2-5% overhead)
K8s application:        90-95 RPS (5-10% overhead)

The overhead is small, but configuration mistakes amplify it drastically:

Misconfigured container: 50-70 RPS
→ The problem isn't the container, it's the configuration

Sources of overhead

Network: overlay networks, NAT, iptables
Storage: copy-on-write, volumes
CPU: cgroups, scheduling
Memory: limits, OOM killer

Configuring Resources Correctly

CPU: requests vs limits

resources:
  requests:
    cpu: "500m"      # Guaranteed by scheduler
  limits:
    cpu: "1000m"     # Maximum allowed

Common mistakes:

# ❌ Limit too low - constant throttling
limits:
  cpu: "100m"

# ❌ No limit - can monopolize node
limits:
  cpu: null

# ❌ High Request = Limit - waste
requests:
  cpu: "2000m"
limits:
  cpu: "2000m"

Recommended configuration:

# ✅ Request based on average usage, limit for peaks
resources:
  requests:
    cpu: "250m"      # Observed average usage
  limits:
    cpu: "1000m"     # Headroom for peaks

CPU Throttling

When a container reaches its CPU limit, it suffers throttling:

No throttling: p99 latency = 50ms
With throttling: p99 latency = 500ms (10x worse)

How to detect:

# Container metrics
cat /sys/fs/cgroup/cpu/cpu.stat
# nr_throttled: number of times throttled
# throttled_time: total time in nanoseconds

Prometheus query:

rate(container_cpu_cfs_throttled_seconds_total[5m])

Memory: the delicate balance

resources:
  requests:
    memory: "512Mi"
  limits:
    memory: "1Gi"

The OOM Killer problem:

Memory used > Limit → OOM Kill → Pod restarts
→ Lost connections, in-flight requests fail

Safe configuration:

# ✅ 20-30% headroom above normal usage
resources:
  requests:
    memory: "512Mi"   # Average usage
  limits:
    memory: "768Mi"   # +50% headroom

JVM in Containers

Older JVMs don't respect container limits:

# Container with 1GB limit
# Old JVM sees 64GB from host machine
# Allocates 16GB heap → Instant OOM Kill

Solution:

# Use JVM 11+ which respects cgroups
FROM eclipse-temurin:17-jre

# Or configure explicitly
ENV JAVA_OPTS="-XX:MaxRAMPercentage=75.0"

Recommended JVM configuration:

env:
  - name: JAVA_OPTS
    value: >-
      -XX:MaxRAMPercentage=75.0
      -XX:InitialRAMPercentage=50.0
      -XX:+UseG1GC
      -XX:+UseContainerSupport

Network Optimization

Service mesh overhead

Without service mesh:  latency = 5ms
With Istio sidecar:    latency = 8-12ms (+60-140%)

When it's worth it:

Distributed observability
Mandatory mTLS
Complex traffic management

When to avoid:

Ultra-low latency critical
High volume of internal requests
Simplicity is priority

DNS lookup

Each request can do DNS lookup:

Request → DNS lookup (2-5ms) → Connection → Response

Optimization:

# Configure dnsPolicy
spec:
  dnsPolicy: ClusterFirst
  dnsConfig:
    options:
      - name: ndots
        value: "2"    # Reduces unnecessary lookups

Connection pooling

# ❌ New connection per request
# TCP handshake + TLS = 50-100ms per request

# ✅ Connection pool
# Reuses established connections

Pool configuration:

// Node.js with connection pool
const pool = new Pool({
  max: 20,
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 2000
});

Storage Optimization

Storage types and performance

emptyDir (memory): ~500MB/s
emptyDir (disk):   ~100MB/s
hostPath:          ~100MB/s
PersistentVolume:  ~50-100MB/s (depends on provider)

Copy-on-Write overhead

Image layers use CoW:

# ❌ Many layers = lots of CoW
RUN apt-get update
RUN apt-get install -y package1
RUN apt-get install -y package2

# ✅ Fewer layers
RUN apt-get update && \
    apt-get install -y package1 package2 && \
    rm -rf /var/lib/apt/lists/*

Logs and performance

# ❌ Logs to stdout without limit
# Disk I/O grows indefinitely

# ✅ Limit container logs
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    # Configure log rotation in runtime

Docker daemon config:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

Kubernetes-Specific Optimizations

Pod Disruption Budget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

Topology Spread

spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: my-app

Readiness vs Liveness

# Liveness: is the app alive?
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

# Readiness: can the app receive traffic?
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Common mistake:

# ❌ Readiness probe too heavy
readinessProbe:
  httpGet:
    path: /health  # Checks DB, cache, external APIs
  periodSeconds: 1  # Every second!
# = DDoS yourself

# ✅ Light readiness with adequate frequency
readinessProbe:
  httpGet:
    path: /health/ready  # Basic check
  periodSeconds: 5

Graceful Shutdown

spec:
  terminationGracePeriodSeconds: 30
  containers:
  - name: app
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 5"]

// Application must handle SIGTERM
process.on('SIGTERM', async () => {
  console.log('Received SIGTERM, shutting down gracefully');
  await server.close();
  await db.close();
  process.exit(0);
});

Optimized Docker Image

Multi-stage build

# Build stage
FROM node:18 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production

# Runtime stage
FROM node:18-slim
WORKDIR /app
COPY --from=builder /app/node_modules ./node_modules
COPY . .
USER node
CMD ["node", "server.js"]

Base image

# ❌ Heavy image
FROM ubuntu:22.04  # ~77MB

# ✅ Optimized image
FROM alpine:3.18   # ~7MB

# ✅ Distroless (even smaller, more secure)
FROM gcr.io/distroless/nodejs:18  # ~40MB, no shell

Startup time

Large image (500MB):      pull = 30-60s
Optimized image (50MB):   pull = 3-6s

Performance Monitoring

Essential metrics

# Container-level
- container_cpu_usage_seconds_total
- container_memory_usage_bytes
- container_network_receive_bytes_total
- container_fs_reads_bytes_total

# Application-level
- http_request_duration_seconds
- http_requests_total
- process_resident_memory_bytes

Minimum dashboard

1. CPU Usage vs Request vs Limit
2. Memory Usage vs Request vs Limit
3. CPU Throttling
4. Pod Restarts
5. Network I/O
6. Disk I/O

Conclusion

Performance in containers depends on:

Correct resources: requests and limits based on real data
Avoid throttling: CPU throttling destroys latency
Memory with headroom: OOM kills cause instability
Optimized network: DNS, connection pools, conscious service mesh
Lean images: fewer layers, smaller size, fast startup

Before blaming the container:

Check for CPU throttling
Confirm there are no OOM kills
Analyze network metrics
Compare with baseline outside the container

A container is a tool, not a villain. Poor container performance is usually poor performance amplified.