AI Workflow Orchestration Platform
Distributed execution system and orchestration layer for AI workflows — converting visual pipelines into production-ready APIs with reliable retry, dependency scheduling, and enterprise connector integrations.
Overview
Salt AI is a platform that lets teams build and deploy AI workflows visually — connecting models, data sources, and enterprise tools into automated pipelines. I joined as a senior engineer to build the backend execution layer that made those workflows production-grade.
My focus was the node runtime and orchestration system: the infrastructure that actually runs workflows reliably, handles failures gracefully, and scales to serve enterprise customers with strict data governance requirements.
Problem
The original workflow system was designed for demos and experimentation, not production. Pipelines would silently fail mid-execution with no retry logic, leaving jobs in ambiguous states. Deploying a workflow as a real API required manual steps, making iteration slow. Each new enterprise integration (Slack, Notion, PostgreSQL, Google Drive) was built ad-hoc, with no consistent connector model. Monitoring was limited to logs — there was no structured visibility into pipeline health, queue depth, or failure rates. Enterprise customers also required isolated execution environments, which the existing architecture didn't support.
Solution
I designed an asynchronous workflow execution system built around Celery and Redis, where each workflow node becomes an idempotent Celery task with explicit dependency edges. FastAPI exposes the orchestration layer as a REST/GraphQL API, allowing visual pipelines to become deployable webhooks or scheduled jobs with a single configuration change.
A standardized connector framework abstracted all external integrations behind a common interface, so adding a new connector (e.g. Notion or PostgreSQL) required only implementing that interface — not writing bespoke sync logic. For enterprise deployments, I used AWS VPC and Kubernetes namespaces to run workflows in fully isolated environments.
Architecture
Visual pipelines are converted into production APIs by an orchestration layer that runs tasks asynchronously across distributed workers, with enterprise-grade isolation via Kubernetes and AWS VPC.
The platform follows a producer-consumer model where the API layer accepts workflow triggers, enqueues tasks into Redis-backed Celery queues, and distributed workers pull and execute nodes. Each node execution is idempotent — safe to retry — and results are checkpointed so partial failures can resume from the last successful state. External connectors run as sidecar-style services within the same Kubernetes pod group, keeping data within the customer's security boundary. Prometheus scrapes queue metrics and worker state; Grafana dashboards surface pipeline health.
Components
FastAPI / GraphQL API
Ingests workflow triggers, manages job lifecycle, exposes webhook endpoints for deployed workflows
Redis Queue
Celery broker and result backend; queues tasks by workflow step with priority and TTL configuration
Celery Workers
Distributed worker pool that executes workflow nodes; each task is idempotent with configurable retry policies
Connector Framework
Standardized interface for external integrations (Slack, Notion, Google Drive, PostgreSQL) — each connector implements a common data ingestion contract
Deployment Pipeline
Docker + Kubernetes manifests auto-generated from workflow definitions; AWS VPC isolation for enterprise tenants
Monitoring Stack
Prometheus metrics from workers and queues; Grafana dashboards for pipeline health; dead-letter queues for failed tasks
Execution Flow
Execution Flow
Key Technical Decisions
Kafka is excellent for high-throughput event streaming but adds significant operational complexity (partition management, consumer group coordination). Our workflows have moderate throughput with complex retry semantics and dependency ordering — Celery with Redis provides exactly-enough guarantees with far simpler operations, easier local development, and built-in retry primitives.
Reliability & Scaling
Idempotent task execution: every Celery task is safe to retry; intermediate state checkpointed in Redis
Dead-letter queues (DLQs): failed tasks after max retries are routed to DLQs for inspection and manual replay
Exponential backoff retry policies configured per connector type — external API failures handled separately from internal errors
Health checks on all worker pods; Kubernetes restarts unhealthy workers without workflow interruption
Prometheus metrics on queue depth, worker lag, task failure rates; Grafana alerts on SLO breaches
Workflow execution state machine: explicit terminal states prevent jobs from hanging in ambiguous in-progress state
Impact
Impact Metrics
Significantly reduced pipeline failure rate by introducing idempotent execution and structured retry policies
Reduced connector integration time by over 60% through the standardized connector framework — new integrations went from days to hours
Enabled enterprise deployments with isolated execution environments, unlocking a new customer segment with strict data governance requirements
Improved deployment reliability and environment isolation by converting manual deployment steps into automated Kubernetes manifests
Introduced structured observability (Prometheus/Grafana) where previously only raw logs existed — teams could now identify and resolve incidents in minutes instead of hours
Tech Stack
Backend
Infrastructure
Data
AI / ML