About

Senior Software Engineer with 12 years at Google, Zing Health, Ambience Health, and Salt AI — spanning ML infrastructure, healthcare platforms, clinical AI, and AI workflow orchestration.

What I build

Orchestration layers, async task pipelines, EHR integrations, and the monitoring that tells you when something breaks. I've built these systems at Google-scale, in Medicare Advantage healthcare, and at AI infrastructure startups.

Why AI + Healthcare

Healthcare IT has real stakes — a broken clinical workflow means a physician can't see a patient's chart. AI makes these problems interesting in a new way, but making those tools reliable in production is serious engineering. That's the part I focus on.

Where I do my best work

Early-stage or scaling startup

End-to-end ownership, direct impact.

Healthcare or high-stakes domain

Where reliability has real consequences.

AI infrastructure

Making ML models useful in production.

Cross-functional teams

Engineers, clinicians, and researchers together.

David Chang

David Chang

Senior Software Engineer

Salt AIAmbience HealthZing HealthGoogle
Open to new roles

How I Design Systems

Architecture is about decisions, not tools.

01

Failure-first design

Before sketching the happy path, I ask: what does this look like when it fails? Idempotency, retry semantics, and dead-letter handling are part of the initial design — not afterthoughts.

02

Simple over clever

Celery + Redis solves 90% of async task problems without Kafka's operational overhead. I pick the tool that fits the actual problem, not the most impressive one.

03

Observability by design

A system you cannot see is a system you cannot trust. Distributed tracing, structured logging, and meaningful metrics are designed in from the start.

System Design Evolution

v1 · Prototype

Brittle

Synchronous. Single process. No retry logic.

Failure mode: Fails silently. Blocks on slow tasks.

v2 · Async

Better

Celery + Redis. Tasks queued asynchronously. Basic retries.

Failure mode: No idempotency. Partial failures corrupt state.

v3 · Production

Production-grade

Idempotent tasks. Checkpointed state. DLQs. Prometheus. K8s isolation.

All failure modes addressed.

Key Tradeoffs

Option A

vs Option B

I choose

Because

Celery + Redis

Kafka

Celery

Simpler operations, built-in retry primitives, sufficient throughput for workflow fan-out

FastAPI

Django REST

FastAPI

Async-first, lower latency, auto OpenAPI docs — fits orchestration API patterns

Idempotent tasks

Stateful execution

Idempotent

Turns hardware failures into cheap retries instead of corrupted state

Canary rollout

Blue-green deploy

Canary

Detects quality regressions under real traffic before full rollout

K8s namespaces

Separate VMs

Namespaces

Strong isolation guarantees with shared cluster infrastructure — lower cost

OpenTelemetry

Log correlation

OTel

End-to-end trace spans across microservices — find failures in minutes

Design philosophy

“The best systems are boring. Simple components, clear failure boundaries, and good observability. Save cleverness for the problem you're actually trying to solve.”