Training Intelligent Agents on OCI Agent Hub with Microsoft Agent Lightning

 

Copyright: Sanjay Basu

How We Built a Self-Improving Supply Chain Intelligence System Using MCP Servers and Oracle Database 26ai

I have spent the better part of three decades watching enterprise AI evolve from rule-based expert systems to the agentic architectures we deploy today. But something fundamentally shifted when Microsoft released Agent Lightning. For the first time, we had a framework that let us train agents using reinforcement learning without ripping apart the agent code we had already written. When I saw the potential to combine this with Oracle’s new Agent Hub platform and the vector capabilities in Database 26ai, I knew we had the ingredients for something genuinely useful.

This article walks through how my team at Oracle Cloud Engineering built a supply chain intelligence agent that learns from its own performance. We deployed it on OCI’s managed LangChain and LangGraph platform, connected it to enterprise data through native MCP servers, and used Oracle Database 26ai as both the knowledge store and the execution environment. The agent now handles complex procurement queries that used to require senior analysts, and it gets measurably better every week.

The Problem We Needed to Solve

Our customer, a multinational manufacturing company, had a procurement team drowning in data. They operated across 47 countries, managed relationships with over 12,000 suppliers, and processed roughly 2.3 million purchase orders annually. Their existing BI dashboards provided retrospective analysis, but what they actually needed was an intelligent assistant that could answer questions like:

“Which suppliers in Southeast Asia have the capacity to absorb a 40% increase in semiconductor orders if our primary vendor in Taiwan experiences disruption?”

“What is the total cost impact if we shift our rare earth sourcing from China to Australia, accounting for shipping, tariffs, quality variance, and lead time changes?”

These questions required the system to pull data from multiple sources, perform calculations, apply business logic, and synthesize an answer that a procurement director could act on immediately. A static RAG pipeline was not going to cut it. We needed an agent that could reason, use tools, and learn from feedback.

Why Agent Lightning Changed Our Approach

Before Agent Lightning, improving an agent meant manual prompt engineering. We would run the agent, review failures, tweak the prompts, and repeat. This worked for simple cases but became unsustainable as agent complexity grew. With a multi-step agent that might invoke five or six tools before producing an answer, tracking down why it made a bad decision required forensic analysis of every intermediate step.

Agent Lightning introduced a different paradigm. Instead of manually debugging prompts, we could define a reward function and let the framework optimize the agent automatically. The framework supports multiple algorithms. Automatic Prompt Optimization uses an LLM to generate critiques and rewrite prompts based on performance patterns. Reinforcement Learning through the VERL backend fine-tunes the underlying model weights. Supervised Fine-Tuning learns from high-reward trajectories.

What made Agent Lightning particularly attractive for our OCI deployment was its framework-agnostic design. We had already built our agent using LangGraph because the cyclic workflow suited the iterative nature of supply chain analysis. Agent Lightning did not require us to rewrite that code. We added a decorator, inserted a few emit statements, and suddenly our existing agent was optimizable.

Architecture Overview

The system we built has four primary layers, each running on OCI infrastructure designed for the specific workload characteristics.

The Agent Layer on OCI Agent Hub

OCI Agent Hub provides managed infrastructure for LangChain and LangGraph applications. We did not have to provision Kubernetes clusters or manage container orchestration for the agent runtime. The platform handles scaling, load balancing, and fault tolerance. Our LangGraph agent runs as a managed service with automatic restarts and health monitoring.

The agent itself implements a ReAct-style loop with specialized nodes for different reasoning phases. The planner node breaks down complex queries into subtasks. The retriever node fetches relevant context from Oracle Database 26ai using vector similarity search. The calculator node performs financial modeling. The validator node checks results against business rules. The synthesizer node produces the final response.

The MCP Server Layer

Model Context Protocol servers provide the bridge between our agent and enterprise systems. MCP is an open standard that Anthropic developed for connecting AI models to external tools and data sources. We implemented three MCP servers for this project.

The first MCP server connects to Oracle Database 26ai and exposes both SQL query capabilities and vector search functions. When the agent needs to find suppliers matching certain criteria, it calls the MCP server with a natural language description, and the server translates that into an optimized SQL query using the database’s built-in AI capabilities.

The second MCP server interfaces with the company’s SAP system for real-time inventory and order data. This required careful attention to authentication and rate limiting since SAP APIs have strict governance requirements.

The third MCP server connects to external market data feeds for commodity pricing, shipping rates, and tariff schedules. This data changes daily, so the server implements caching with appropriate TTLs.

The Training Layer

Agent Lightning training runs on OCI GPU instances. We use VM.GPU.A10.1 shapes for Automatic Prompt Optimization since that algorithm primarily needs inference capability. For full reinforcement learning runs, we scale up to VM.GPU.A100.4 shapes to handle the policy gradient computations efficiently.

The training jobs run as Kubernetes batch workloads on OKE. We configured the cluster with node pool autoscaling so GPU nodes spin up only when training is active. This reduced our GPU costs by roughly 70% compared to keeping instances running continuously.

The Data Layer with Oracle Database 26ai

Oracle Database 26ai serves as the backbone of this system. The release introduced native vector search capabilities that eliminate the need for separate vector databases. We store embeddings directly alongside relational data, which means a single query can combine semantic similarity matching with traditional SQL filtering.

For our supply chain use case, we embedded supplier capability descriptions, product specifications, historical performance reports, and contract terms. When the agent searches for suppliers capable of handling increased semiconductor orders, the vector search retrieves semantically relevant candidates while SQL filters enforce hard constraints like certification status and geographic location.

The database also provides the execution environment for complex analytical queries. Oracle’s optimizer handles multi-table joins across billions of rows efficiently, which matters when we need to calculate cost impacts across the entire supply network.

Implementation Details

Instrumenting the LangGraph Agent

The core agent implementation uses LangGraph’s StateGraph to manage the workflow. Here is a simplified version of how we structured the main agent with Agent Lightning instrumentation.

import agentlightning as agl
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from typing import TypedDict, Annotated
import operator

class SupplyChainState(TypedDict):
    query: str
    plan: list[str]
    retrieved_context: list[dict]
    calculations: dict
    validation_status: str
    final_answer: str
    messages: Annotated[list, operator.add]

@agl.rollout
def supply_chain_agent(
    task: dict,
    prompt_template: agl.PromptTemplate
) -> float:
    
    llm = ChatOpenAI(model="gpt-4-turbo", temperature=0.1)
    
    # Build the graph
    workflow = StateGraph(SupplyChainState)
    
    # Add nodes
    workflow.add_node("planner", create_planner_node(llm, prompt_template))
    workflow.add_node("retriever", create_retriever_node())
    workflow.add_node("calculator", create_calculator_node())
    workflow.add_node("validator", create_validator_node())
    workflow.add_node("synthesizer", create_synthesizer_node(llm))
    
    # Define edges
    workflow.set_entry_point("planner")
    workflow.add_edge("planner", "retriever")
    workflow.add_edge("retriever", "calculator")
    workflow.add_edge("calculator", "validator")
    workflow.add_conditional_edges(
        "validator",
        route_after_validation,
        {"retry": "retriever", "complete": "synthesizer"}
    )
    workflow.add_edge("synthesizer", END)
    
    # Compile and run
    app = workflow.compile()
    
    initial_state = {
        "query": task["question"],
        "plan": [],
        "retrieved_context": [],
        "calculations": {},
        "validation_status": "",
        "final_answer": "",
        "messages": []
    }
    
    result = app.invoke(initial_state)
    
    # Emit intermediate steps for training analysis
    agl.emit_object({
        "plan": result["plan"],
        "retrieval_count": len(result["retrieved_context"]),
        "validation_status": result["validation_status"]
    })
    
    # Calculate reward
    reward = compute_reward(
        result["final_answer"],
        task.get("expected_answer"),
        task.get("expected_sources"),
        result
    )
    
    agl.emit_reward(reward)
    return reward

The @agl.rollout decorator tells Agent Lightning to capture this function's execution as a training episode. The prompt_template parameter gets injected automatically with the current optimized version. The emit_object and emit_reward calls create spans that the training algorithm uses to understand what happened during execution.

Building the Oracle MCP Server

The MCP server for Oracle Database 26ai required careful design to expose the right level of abstraction. We did not want the agent constructing raw SQL since that would bypass the database’s AI query capabilities. Instead, we exposed high-level functions that the database could optimize internally.

from mcp.server import Server
from mcp.types import Tool, TextContent
import oracledb
import json

server = Server("oracle-supply-chain")

@server.tool()
async def search_suppliers(
    capability_description: str,
    min_capacity_units: int = 0,
    regions: list[str] = None,
    certifications: list[str] = None,
    max_results: int = 10
) -> list[dict]:
    """
    Search for suppliers matching capability requirements.
    Uses vector similarity on capability descriptions combined
    with relational filters on capacity, region, and certifications.
    """
    
    async with get_oracle_connection() as conn:
        cursor = conn.cursor()
        
        # Generate embedding for the search query
        embedding = await generate_embedding(capability_description)
        
        # Build the query using Oracle 26ai vector search
        query = """
        SELECT 
            s.supplier_id,
            s.supplier_name,
            s.primary_region,
            s.annual_capacity,
            s.lead_time_days,
            s.quality_score,
            VECTOR_DISTANCE(s.capability_embedding, :query_vector, COSINE) as similarity
        FROM suppliers s
        WHERE s.annual_capacity >= :min_capacity
        """
        
        params = {
            "query_vector": embedding,
            "min_capacity": min_capacity_units
        }
        
        if regions:
            query += " AND s.primary_region IN (SELECT column_value FROM TABLE(:regions))"
            params["regions"] = regions
            
        if certifications:
            query += """
            AND EXISTS (
                SELECT 1 FROM supplier_certifications sc
                WHERE sc.supplier_id = s.supplier_id
                AND sc.certification_type IN (SELECT column_value FROM TABLE(:certs))
            )
            """
            params["certs"] = certifications
        
        query += """
        ORDER BY similarity
        FETCH FIRST :max_results ROWS ONLY
        """
        params["max_results"] = max_results
        
        cursor.execute(query, params)
        
        results = []
        for row in cursor:
            results.append({
                "supplier_id": row[0],
                "name": row[1],
                "region": row[2],
                "capacity": row[3],
                "lead_time": row[4],
                "quality_score": row[5],
                "relevance": 1 - row[6]  # Convert distance to similarity
            })
        
        return results

@server.tool()
async def calculate_sourcing_impact(
    current_supplier_id: str,
    alternative_supplier_id: str,
    annual_volume: int,
    product_category: str
) -> dict:
    """
    Calculate the total cost impact of switching between suppliers.
    Accounts for unit pricing, shipping, tariffs, quality variance,
    and lead time carrying costs.
    """
    
    async with get_oracle_connection() as conn:
        cursor = conn.cursor()
        
        # Use Oracle's analytical functions for complex calculations
        query = """
        WITH cost_comparison AS (
            SELECT
                cs.unit_price as current_price,
                alt.unit_price as alt_price,
                cs.shipping_cost_per_unit as current_shipping,
                alt.shipping_cost_per_unit as alt_shipping,
                cs.tariff_rate as current_tariff,
                alt.tariff_rate as alt_tariff,
                cs.avg_defect_rate as current_defect,
                alt.avg_defect_rate as alt_defect,
                cs.lead_time_days as current_lead,
                alt.lead_time_days as alt_lead,
                :volume as annual_volume,
                0.12 as carrying_cost_rate,  -- 12% annual carrying cost
                50 as daily_demand  -- Average daily demand
            FROM supplier_pricing cs, supplier_pricing alt
            WHERE cs.supplier_id = :current_id
            AND alt.supplier_id = :alt_id
            AND cs.product_category = :category
            AND alt.product_category = :category
        )
        SELECT
            annual_volume * (alt_price - current_price) as unit_price_impact,
            annual_volume * (alt_shipping - current_shipping) as shipping_impact,
            annual_volume * alt_price * (alt_tariff - current_tariff) as tariff_impact,
            annual_volume * alt_price * (alt_defect - current_defect) as quality_impact,
            (alt_lead - current_lead) * daily_demand * alt_price * 
                (carrying_cost_rate / 365) as lead_time_impact
        FROM cost_comparison
        """
        
        cursor.execute(query, {
            "current_id": current_supplier_id,
            "alt_id": alternative_supplier_id,
            "category": product_category,
            "volume": annual_volume
        })
        
        row = cursor.fetchone()
        
        if not row:
            return {"error": "Could not find pricing data for specified suppliers"}
        
        total_impact = sum(row)
        
        return {
            "unit_price_impact": float(row[0]),
            "shipping_impact": float(row[1]),
            "tariff_impact": float(row[2]),
            "quality_impact": float(row[3]),
            "lead_time_impact": float(row[4]),
            "total_annual_impact": float(total_impact),
            "recommendation": "favorable" if total_impact < 0 else "unfavorable"
        }

The MCP server exposes these functions as tools that the LangGraph agent can invoke. The agent does not need to understand Oracle-specific syntax or optimization hints. It simply describes what it needs, and the MCP server handles the translation.

Designing the Reward Function

The reward function determines what the agent learns to optimize. We spent considerable time getting this right because poorly designed rewards lead to agents that game the metric instead of solving the actual problem.

Our reward function has four components weighted by business priority.

def compute_reward(
    answer: str,
    expected_answer: str | None,
    expected_sources: list[str] | None,
    execution_state: dict
) -> float:
    
    reward = 0.0
    
    # Correctness (40% weight)
    # If we have a ground truth, measure semantic similarity
    if expected_answer:
        correctness = semantic_similarity(answer, expected_answer)
        reward += 0.4 * correctness
    else:
        # For production queries without ground truth,
        # check if the answer is well-formed and cites sources
        well_formed = check_answer_structure(answer)
        reward += 0.4 * well_formed
    
    # Source quality (25% weight)
    # Did the agent use authoritative sources?
    if expected_sources:
        source_overlap = compute_source_overlap(
            execution_state.get("retrieved_context", []),
            expected_sources
        )
        reward += 0.25 * source_overlap
    else:
        # Check that sources are from trusted systems
        source_trust = compute_source_trust_score(
            execution_state.get("retrieved_context", [])
        )
        reward += 0.25 * source_trust
    
    # Efficiency (20% weight)
    # Penalize unnecessary tool calls and retrieval loops
    plan_length = len(execution_state.get("plan", []))
    retrieval_count = len(execution_state.get("retrieved_context", []))
    
    efficiency = 1.0
    if plan_length > 5:
        efficiency -= 0.1 * (plan_length - 5)
    if retrieval_count > 20:
        efficiency -= 0.05 * (retrieval_count - 20)
    
    reward += 0.2 * max(0, efficiency)
    
    # Business rule compliance (15% weight)
    # Did the agent follow required constraints?
    validation_status = execution_state.get("validation_status", "")
    if validation_status == "passed":
        reward += 0.15
    elif validation_status == "passed_with_warnings":
        reward += 0.10
    
    return min(reward, 1.0)

We validated this reward function by having domain experts score a sample of agent outputs manually. The computed rewards correlated strongly with expert judgments, which gave us confidence that optimizing this metric would produce genuinely better agents.

Training on OCI

We configured Agent Lightning to use Automatic Prompt Optimization for the initial training phase. APO works by running batches of queries, collecting the rewards, generating a textual critique of failure patterns, and rewriting the prompt to address those patterns.

from agentlightning import Trainer
from agentlightning.algorithms import APOAlgorithm
from agentlightning.store import MongoDBStore

# Connect to MongoDB running on OCI
store = MongoDBStore(
    connection_string=os.environ["MONGODB_URI"],
    database="supply_chain_agent",
    collection_prefix="training_v1"
)

# Configure APO
algorithm = APOAlgorithm(
    optimizer_model="gpt-4-turbo",
    batch_size=20,
    max_iterations=100,
    temperature=0.7,
    gradient_prompt_template="""
    Analyze these supply chain agent executions and identify improvement opportunities.
    
    SUCCESSFUL EXECUTIONS:
    {successes}
    
    FAILED EXECUTIONS:
    {failures}
    
    CURRENT PROMPT TEMPLATE:
    {current_prompt}
    
    Focus on:
    1. Patterns in queries that the agent handles poorly
    2. Missing instructions that would prevent common errors
    3. Ambiguities in the prompt that lead to inconsistent behavior
    
    Provide specific, actionable suggestions for prompt improvements.
    """
)

trainer = Trainer(
    algorithm=algorithm,
    store=store,
    checkpoint_dir="/mnt/checkpoints"
)

# Load training data
train_data = load_supply_chain_queries("data/train.json")
val_data = load_supply_chain_queries("data/validation.json")

# Run training
result = trainer.fit(
    agent=supply_chain_agent,
    train_dataset=train_data,
    val_dataset=val_data,
    max_epochs=10
)

print(f"Final validation reward: {result.final_val_reward:.3f}")
print(f"Improvement over baseline: {result.improvement_percentage:.1f}%")

The training job runs on OKE with GPU node pools. We use Kubernetes Jobs rather than Deployments since training is a batch process with a defined end state. The job specification requests GPU resources and mounts persistent volumes for checkpoints and training data.

Results and Observations

After 100 APO iterations, our agent showed a 47% improvement in the composite reward metric. Breaking that down by component:

Correctness improved from 0.58 to 0.81. The optimized prompts included more specific instructions about handling ambiguous queries and edge cases like suppliers with incomplete data.

Source quality improved from 0.62 to 0.79. The agent learned to prioritize official contract data over historical performance reports when both were relevant.

Efficiency improved from 0.71 to 0.88. The agent reduced unnecessary retrieval loops by better planning upfront.

Business rule compliance improved from 0.65 to 0.92. The prompts now include explicit reminders about regulatory constraints and approval thresholds.

Beyond the metrics, the qualitative feedback from procurement analysts was positive. They reported that the agent’s answers required less manual verification and included more relevant context without being verbose.

Lessons Learned

Several insights emerged from this project that will inform future agent development work.

First, the reward function matters more than the algorithm. We spent two weeks refining our reward function before running serious training experiments. That investment paid off. Early versions of the reward function inadvertently incentivized the agent to retrieve excessive context because more sources correlated with higher trust scores. We fixed this by adding the efficiency penalty.

Second, MCP servers provide clean abstraction boundaries. By exposing database capabilities through well-designed tools rather than raw SQL access, we gave the agent appropriate leverage without overwhelming it with implementation details. The agent reasons about what data it needs, and the MCP server handles how to get it efficiently.

Third, Oracle Database 26ai’s integrated vector search simplified our architecture significantly. Earlier versions of this system used a separate vector database that required synchronization with the relational data. Having everything in one database meant one source of truth and one transaction boundary.

Fourth, Agent Lightning’s framework-agnostic approach let us iterate quickly. We changed our LangGraph workflow several times during development without touching the training infrastructure. The @rollout decorator and emit functions isolated the training concerns from the agent logic.

What Comes Next

We are currently running reinforcement learning experiments using the VERL backend. APO optimizes prompts, but RL can optimize the model weights themselves. For a complex multi-step agent like ours, RL might discover strategies that no prompt could elicit from the base model.

We are also exploring continuous learning. The system already collects feedback from analysts who mark answers as correct or incorrect. We plan to feed this feedback into ongoing training runs so the agent improves based on production usage patterns.

Finally, we want to extend this architecture to other enterprise domains. The combination of Agent Lightning for training, OCI Agent Hub for deployment, MCP servers for integration, and Oracle Database 26ai for data provides a reusable pattern. Supply chain was our first use case, but the same approach applies to financial analysis, customer service, compliance monitoring, and many other enterprise applications.

The era of static, manually-tuned agents is ending. Self-improving agents that learn from their performance are now practical to build and deploy. The infrastructure finally caught up with the vision.

Appendix

1. Core Workflow Diagram

This flowchart shows the main components and their interactions.

Article content
Fig 1

2. Runtime Sequence Diagram

This sequence diagram shows the temporal flow of a single query execution and the training feedback loop.

Article content
Copyright: Sanjay Basu

3. Training State Machine

This state diagram shows the APO training loop states and transitions.

Article content
Copyright: Sanjay Basu

4. Simplified Agent Flow

A minimal version showing just the core agent workflow for presentations.

Article content
Copyright: Sanjay Basu

5. Component Interaction Matrix

A simple diagram showing which components interact with which.

Article content
Copyright: Sanjay Basu

Sanjay Basu, PhD, is Senior Director of GPU and Gen AI Solutions at Oracle Cloud Engineering, where he leads strategic initiatives in AI infrastructure and enterprise agent development.

Comments

Popular posts from this blog

Digital Selfhood

Axiomatic Thinking

How MSPs Can Deliver IT-as-a-Service with Better Governance