High-Level Open Scheduler Workflow

System Design · sequence diagram · unknown license

Illustrates the high-level workflow of an open scheduler, from client job submission to resource reservation, instance provisioning, and status updates acr

Source: https://github.com/cloudnesia/open-scheduler/blob/0450749fec812206b2560434c70c1b17118a9650/README/workflow.md
Curated by cloudnesia
Scheduler Job Management Cloud Orchestration System Design Workflow Distributed Systems Resource Management

Mermaid source

sequenceDiagram
  participant Client
  participant ControlPlane
  participant Scheduler
  participant DB
  participant NodeAgent
  participant LocalQueue
  participant ProviderDriver
  participant Provider

  Client->>ControlPlane: SubmitJob(jobSpec)
  ControlPlane->>DB: INSERT job status = Pending
  ControlPlane->>Scheduler: notify pending job
  Scheduler->>DB: SELECT nodes snapshot
  Scheduler->>DB: BEGIN TX and reserve node resources
  Scheduler->>DB: UPDATE job.assigned_node and set job.status = Reserved
  Scheduler->>DB: CREATE binding record with status = Reserved
  Note right of Scheduler: generate stable idempotency_token (e.g. job-123-order-1)
  Scheduler->>NodeAgent: Order(order_id, job_id, idempotency_token, spec)
  NodeAgent->>LocalQueue: enqueue Order (persist token)
  LocalQueue->>ProviderDriver: dequeue -> CHECK existing instance by idempotency_token
  alt existing instance found
    ProviderDriver-->>LocalQueue: return existing_instance_id
  else no existing instance
    ProviderDriver->>Provider: POST /1.0/instances (include metadata idempotency_token)
    Provider->>ProviderDriver: returns operation_id
    ProviderDriver->>LocalQueue: persist mapping op_id <-> order_id
    LocalQueue->>ProviderDriver: poll /1.0/operations/op_id until done
    Provider->>ProviderDriver: operation done (success + instance_name)
    ProviderDriver-->>LocalQueue: return created_instance_id
  end
  LocalQueue->>NodeAgent: OrderStatus(order_id, job_id, instance_id, state = Created)
  NodeAgent->>ControlPlane: OrderStatus (via NodeStream)
  ControlPlane->>DB: UPDATE binding status = Created and set instance id
  ControlPlane->>DB: UPDATE job.status = Running
  ControlPlane->>Client: JobStatus(Running, node = ...)

What this diagram shows

This diagram details the end-to-end process of a job being submitted to an open scheduler. It covers the interactions between the client, control plane, scheduler, database, node agent, local queue, provider driver, and the underlying cloud provider. Key steps include job submission, database persistence, resource reservation, idempotent instance provisioning, and status updates back to the client.

When to use it

Use this diagram when designing a distributed job scheduling system, understanding resource orchestration workflows, or debugging job lifecycle issues in a cloud-native environment. It's useful for illustrating how different services coordinate to execute a task.

How to adapt it for your project

This workflow can be adapted by adding more sophisticated scheduling algorithms, integrating with different cloud providers, or incorporating advanced error handling and retry mechanisms. You could also extend it to include pre-emption, scaling, or cost optimization steps.

Key concepts

Distributed Job Scheduling
Resource Reservation
Idempotent Provisioning
Control Plane
Node Agent