Code vs Architecture: where the real problem lies

"The code is optimized, but the system is still slow." Or: "We refactored everything, but the performance is the same." These symptoms indicate confusion between code problems and architecture problems. This article teaches you to distinguish between the two and attack the right problem.

Optimizing code in a broken architecture is polishing the Titanic while it sinks.

The Fundamental Difference

Code problems

Characteristics:
  - Located in a function/class/module
  - Solved with targeted refactoring
  - Impact limited to the component
  - Don't require design change

Examples:
  - O(n²) algorithm that should be O(n)
  - Unnecessary loop
  - N+1 query
  - Inefficient serialization

Architecture problems

Characteristics:
  - Distributed throughout the system
  - Require design change
  - Systemic impact
  - Not solved with local optimization

Examples:
  - Synchronous communication where it should be asynchronous
  - Excessive coupling between services
  - Wrong database for the use case
  - Missing cache in critical layer

Symptoms of Code Problems

1. Localized hotspot

Trace shows:
  ┌─ Service A: 50ms
  │  └─ Function X: 45ms  ← 90% of time here
  ├─ Service B: 30ms
  └─ Service C: 20ms

Diagnosis:
  Problem located in Function X
  → Solution: Optimize Function X

2. High CPU in one component

Metrics:
  Service A: CPU 95%
  Service B: CPU 15%
  Service C: CPU 10%

Diagnosis:
  Inefficient code in Service A
  → Profiler will identify the function

3. Specific slow query

Slow query log:
  SELECT * FROM orders
  WHERE user_id = ?
  AND status = 'pending'
  Time: 2.5s

EXPLAIN:
  Seq Scan on orders (no index)

Diagnosis:
  Code problem (missing index)
  → Solution: CREATE INDEX

Symptoms of Architecture Problems

1. Uniformly distributed latency

Trace shows:
  ┌─ Service A: 200ms
  │  └─ Call to B: 180ms
  ├─ Service B: 180ms
  │  └─ Call to C: 160ms
  └─ Service C: 160ms
       └─ Call to D: 140ms

Diagnosis:
  Chain of synchronous calls
  → Solution: Rethink the design (async? aggregation?)

2. All services slow

Metrics under load:
  Service A: p95 = 2s (normal: 100ms)
  Service B: p95 = 1.8s (normal: 80ms)
  Service C: p95 = 1.5s (normal: 50ms)

Diagnosis:
  Systemic saturation, not localized
  → Capacity or design problem

3. Bottleneck moves when you optimize

Before:
  DB is bottleneck → Add cache
After:
  Cache is bottleneck → Increase cluster
After:
  Network is bottleneck → ???

Diagnosis:
  Architecture doesn't scale
  → Need to redesign, not optimize points

Diagnostic Framework

Step 1: End-to-end trace

Collect complete trace:
  Request → Gateway → Service A → DB
                   → Service B → Cache
                   → Service C → External API

Analyze:
  - Where is the time?
  - Is it localized or distributed?
  - Is there a pattern (always the same component)?

Step 2: Profile individual components

For each slow component:
  - Run profiler (CPU, memory, I/O)
  - Identify top functions
  - Check if it's code or waiting

Example:
  Service A profile shows:
  - 80% of time in http.call()  ← Waiting (architecture)
  - 15% in json.parse()          ← Code
  - 5% in business logic         ← Code

Step 3: Isolation test

Test component isolated:
  - Remove dependencies (mock/stub)
  - Apply same load
  - Measure latency

If isolated is fast → Architecture problem
If isolated is slow → Code problem

Solutions by Problem Type

For code problems

Inefficient algorithm:
  - Refactor to lower complexity
  - Use appropriate data structure

Slow query:
  - Add index
  - Rewrite query
  - Denormalize if necessary

Serialization:
  - Change format (JSON → Protobuf)
  - Reduce payload

Memory:
  - Object pooling
  - Stream processing
  - Lazy loading

For architecture problems

Excessive synchronous calls:
  - Introduce messaging (async)
  - Aggregate calls (batch)
  - Cache results

Coupling:
  - Separate domains
  - Event-driven architecture
  - CQRS for read/write

Inadequate database:
  - Polyglot persistence
  - Read replicas
  - Specialized database (time-series, graph)

Missing cache:
  - Distributed cache
  - Edge cache (CDN)
  - Layered cache

Practical Examples

Example 1: Looks like code, is architecture

Symptom:
  "Listing API slow (2s)"

Initial analysis:
  Developer assumes: "Database query slow"

Investigation:
  - Query takes 50ms ✓
  - Service takes 2s total
  - Trace shows: 20 calls to image service

Real diagnosis:
  N+1 at service level
  For each product, calls image service

Solution (architecture):
  - Batch: fetch all images in one call
  - Or: URL embedding in product
  - Or: CDN with predictable URL

Example 2: Looks like architecture, is code

Symptom:
  "Entire system slow under load"

Initial analysis:
  Developer assumes: "We need more servers"

Investigation:
  - Scaling doesn't help
  - All pods high CPU
  - Profile shows: regex in loop

Real diagnosis:
  Catastrophic regex in email validation
  Called thousands of times per request

Solution (code):
  - Replace regex with simple validation
  - Or: Compile regex once (singleton)
  - Result: 10x more capacity with same hardware

Example 3: Both problems

Symptom:
  "Checkout slow and unstable"

Investigation:
  Problem 1 (code):
    - Shipping calculation runs 3x (duplicated)
    - Inefficient JSON parsing

  Problem 2 (architecture):
    - 8 synchronous calls to complete
    - No fallback when external service slow
    - No cache for rarely changed data

Solution:
  1. Code fixes (quick):
     - Remove duplication
     - Optimize parsing

  2. Architecture refactor (planned):
     - Aggregate calls
     - Add circuit breaker
     - Implement cache

Decision Tree

The system is slow
        │
        ▼
┌───────────────────────┐
│ Trace shows bottleneck│
│ in single component?  │
└───────────┬───────────┘
            │
       ┌────┴────┐
       │         │
       ▼         ▼
      Yes        No
       │         │
       ▼         ▼
  ┌─────────┐ ┌───────────────┐
  │ Profile │ │ Latency is in │
  │ the     │ │ waiting (I/O) │
  │component│ │ or CPU?       │
  └────┬────┘ └──────┬────────┘
       │             │
       ▼        ┌────┴────┐
  High CPU?     │         │
       │      Waiting    CPU
  ┌────┴────┐    │         │
  │         │    ▼         ▼
 Yes        No Architecture Code
  │         │  (communication) (distributed)
  ▼         ▼
Code    Architecture
(local) (I/O bound)

Conclusion

Distinguish between code and architecture:

Trace first - understand the complete flow
Profile the hotspots - CPU or waiting?
Test isolated - is the component alone fast?
Apply correct solution - don't use hammer on screw

The golden rule:

Localized problem → Code optimization
Distributed problem → Architecture review
Both → Code first (faster), architecture later

It doesn't matter to have the most efficient code in the world if the architecture makes it wait.

This article is part of the series on the OCTOPUS Performance Engineering methodology.