All Projects
Salt AIJan 2025 – Mar 202660% faster integrations

AI Workflow Orchestration Platform

Distributed execution system and orchestration layer for AI workflows — converting visual pipelines into production-ready APIs with reliable retry, dependency scheduling, and enterprise connector integrations.

PythonFastAPICeleryRedisKubernetesAWSDockerGraphQL

Overview

Salt AI is a platform that lets teams build and deploy AI workflows visually — connecting models, data sources, and enterprise tools into automated pipelines. I joined as a senior engineer to build the backend execution layer that made those workflows production-grade.

My focus was the node runtime and orchestration system: the infrastructure that actually runs workflows reliably, handles failures gracefully, and scales to serve enterprise customers with strict data governance requirements.

Problem

The original workflow system was designed for demos and experimentation, not production. Pipelines would silently fail mid-execution with no retry logic, leaving jobs in ambiguous states. Deploying a workflow as a real API required manual steps, making iteration slow. Each new enterprise integration (Slack, Notion, PostgreSQL, Google Drive) was built ad-hoc, with no consistent connector model. Monitoring was limited to logs — there was no structured visibility into pipeline health, queue depth, or failure rates. Enterprise customers also required isolated execution environments, which the existing architecture didn't support.

Solution

I designed an asynchronous workflow execution system built around Celery and Redis, where each workflow node becomes an idempotent Celery task with explicit dependency edges. FastAPI exposes the orchestration layer as a REST/GraphQL API, allowing visual pipelines to become deployable webhooks or scheduled jobs with a single configuration change.

A standardized connector framework abstracted all external integrations behind a common interface, so adding a new connector (e.g. Notion or PostgreSQL) required only implementing that interface — not writing bespoke sync logic. For enterprise deployments, I used AWS VPC and Kubernetes namespaces to run workflows in fully isolated environments.

Architecture

Visual PipelineUser-defined nodesOrchestration APIFastAPI + GraphQLWorker PoolCelery + RedisResultWebhook / APIEnterprise isolation via Kubernetes namespaces + AWS VPC

Visual pipelines are converted into production APIs by an orchestration layer that runs tasks asynchronously across distributed workers, with enterprise-grade isolation via Kubernetes and AWS VPC.

The platform follows a producer-consumer model where the API layer accepts workflow triggers, enqueues tasks into Redis-backed Celery queues, and distributed workers pull and execute nodes. Each node execution is idempotent — safe to retry — and results are checkpointed so partial failures can resume from the last successful state. External connectors run as sidecar-style services within the same Kubernetes pod group, keeping data within the customer's security boundary. Prometheus scrapes queue metrics and worker state; Grafana dashboards surface pipeline health.

Components

FastAPI / GraphQL API

Ingests workflow triggers, manages job lifecycle, exposes webhook endpoints for deployed workflows

Redis Queue

Celery broker and result backend; queues tasks by workflow step with priority and TTL configuration

Celery Workers

Distributed worker pool that executes workflow nodes; each task is idempotent with configurable retry policies

Connector Framework

Standardized interface for external integrations (Slack, Notion, Google Drive, PostgreSQL) — each connector implements a common data ingestion contract

Deployment Pipeline

Docker + Kubernetes manifests auto-generated from workflow definitions; AWS VPC isolation for enterprise tenants

Monitoring Stack

Prometheus metrics from workers and queues; Grafana dashboards for pipeline health; dead-letter queues for failed tasks

Execution Flow

Execution Flow

TriggerWebhook / API
FastAPIValidate & enqueue
Redis QueueCelery broker
WorkerExecute node
ConnectorExternal system
ResultCheckpoint & ack
Step 1 of 6: TriggerWebhook / API

Key Technical Decisions

Kafka is excellent for high-throughput event streaming but adds significant operational complexity (partition management, consumer group coordination). Our workflows have moderate throughput with complex retry semantics and dependency ordering — Celery with Redis provides exactly-enough guarantees with far simpler operations, easier local development, and built-in retry primitives.

Reliability & Scaling

  • Idempotent task execution: every Celery task is safe to retry; intermediate state checkpointed in Redis

  • Dead-letter queues (DLQs): failed tasks after max retries are routed to DLQs for inspection and manual replay

  • Exponential backoff retry policies configured per connector type — external API failures handled separately from internal errors

  • Health checks on all worker pods; Kubernetes restarts unhealthy workers without workflow interruption

  • Prometheus metrics on queue depth, worker lag, task failure rates; Grafana alerts on SLO breaches

  • Workflow execution state machine: explicit terminal states prevent jobs from hanging in ambiguous in-progress state

Impact

Impact Metrics

Connector Integration Time 0% reduction
Pipeline Failure Rate 0% reduction
Deployment Reliability 0% improvement
Incident Resolution Speed 0% faster

Significantly reduced pipeline failure rate by introducing idempotent execution and structured retry policies

Reduced connector integration time by over 60% through the standardized connector framework — new integrations went from days to hours

Enabled enterprise deployments with isolated execution environments, unlocking a new customer segment with strict data governance requirements

Improved deployment reliability and environment isolation by converting manual deployment steps into automated Kubernetes manifests

Introduced structured observability (Prometheus/Grafana) where previously only raw logs existed — teams could now identify and resolve incidents in minutes instead of hours

Tech Stack

Backend

PythonFastAPIDjangoGraphQLCeleryREST APIs

Infrastructure

KubernetesDockerAWSAWS VPCAWS IAMPrometheusGrafana

Data

RedisPostgreSQL

AI / ML

LLM IntegrationsAI Workflow OrchestrationModel Pipeline Execution