Exploring AI-ML-NLP: Next-Gen AI: Multi-Agent LLMs and Policy Gradient RL (Explained)

Introduction.

Artificial Intelligence (AI) is moving beyond single-task chatbots and into a future where multiple smart agents work together—and learn from their experiences. This new wave of AI is powered by Multi-Agent Large Language Models (LLMs) and Reinforcement Learning (RL). Let’s break down what this means, and why it matters for everyone.

What Are Multi-Agent LLMs?

If you’ve ever chatted with an AI like ChatGPT or Google Gemini, you’ve experienced a single “agent” at work. But imagine if you had a whole team of AI experts—each with a different specialty—collaborating to answer your questions or solve your problems.

That’s what Multi-Agent LLMs are: several AI “personalities” (like a general doctor, a specialist, and a risk manager) working together. They can ask each other questions, give advice, and debate the best answer—just like a real-world panel of experts.

What Is Reinforcement Learning?

Reinforcement Learning (RL) is how AI learns by doing. The AI agent tries actions, gets feedback (rewards for good decisions, penalties for mistakes), and gradually figures out the smartest way to act. It’s like how we learn new skills—trial and error, over many attempts.

Why Combine Them?

When we combine the “brainpower” of multiple LLM agents with the ability of RL to learn from experience, you get something powerful:

The AI agent learns to use advice from different experts, not just rely on one.
Over time, it gets better at making complex decisions—whether it’s diagnosing patients, handling business workflows, or answering tough questions.
The teamwork approach makes the system more robust, explainable, and safe.

A Real Example

In a recent AI project, we trained an agent to diagnose patient cases. It didn’t just rely on one answer—instead, it asked three LLM advisors (each playing a different medical role) for opinions, then decided what to do. As it learned from rewards and mistakes, its accuracy went up. That’s the magic of next-gen AI: collaborative, continuously learning, and smarter with every step.

Tutorial.

Code:

import numpy as np
import random
import keras
from keras import layers
from keras.optimizers import Adam
from keras.models import Model
from sentence_transformers import SentenceTransformer
import matplotlib.pyplot as plt
import requests
import time
import tensorflow as tf

# -------- Groq API Config --------
USE_REAL_LLM = True  # Set False for mock/test
GROQ_API_KEY = "Use your key"
ENDPOINT = "https://api.groq.com/openai/v1/chat/completions"
MODEL = "llama3-70b-8192"  # Or "llama3-8b-8192"

N_EPISODES = 10      # Lower for demo, increase for more training
EMBED_DIM = 32
N_ADVISORS = 3
N_ACTIONS = 5
GAMMA = 0.99

# ---- Synthetic Patient Dataset ----
patient_cases = [
    (1, 1, 1, 'flu'),
    (1, 0, 0, 'cold'),
    (0, 1, 0, 'cold'),
    (1, 1, 0, 'flu'),
    (1, 0, 1, 'flu'),
    (0, 1, 1, 'flu'),
    (0, 0, 0, 'cold'),
    (1, 0, 0, 'cold'),
    (0, 0, 1, 'cold'),
]
def sample_case():
    return random.choice(patient_cases)

# ---- Real Groq LLM API Adapter ----
def query_llm_groq(prompt, personality_name):
    if not USE_REAL_LLM:
        if personality_name == "Internist":
            return "Stepwise testing is safest; treat if strong evidence only."
        elif personality_name == "Specialist":
            return "Rule out severe cases, do broad diagnostics."
        elif personality_name == "Generalist":
            return "Prioritize patient comfort and minimal intervention."
        else:
            return "No specific advice."
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {GROQ_API_KEY}"
    }
    system_prompt = f"You are a {personality_name} medical advisor. Return a one-sentence actionable recommendation for the case."
    payload = {
        "model": MODEL,
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ],
        "max_tokens": 50,
        "temperature": 0.3,
        "n": 1
    }
    for attempt in range(3):  # Retry on error
        try:
            response = requests.post(ENDPOINT, headers=headers, json=payload, timeout=20)
            if response.status_code == 200:
                out = response.json()
                return out['choices'][0]['message']['content'].strip()
            else:
                print(f"Groq LLM error code {response.status_code}, retrying...")
                time.sleep(2)
        except Exception as e:
            print(f"Groq Exception: {e}, retrying...")
            time.sleep(2)
    return "[LLM Error or Timeout]"

# --- LLM Advisors (Groq + Llama3, with negotiation) ---
def get_advisors(state, prev_advices=None):
    personalities = ["Internist", "Specialist", "Generalist"]
    advices = []
    for idx, personality in enumerate(personalities):
        prompt = f"Patient symptoms: fever={state[0]}, cough={state[1]}, risk factors={state[2]}."
        if prev_advices:
            prompt += f" Previous advisor opinions: {' | '.join(prev_advices)}"
            prompt += " Revise or comment if needed."
        advice = query_llm_groq(prompt, personality)
        advices.append(advice)
    return advices

# --- Patient Environment ---
class PatientEnv:
    def reset(self):
        fever, cough, risk, diag = sample_case()
        self.state = [fever, cough, risk]
        self.true_diagnosis = diag
        return np.array(self.state, dtype=np.float32), diag
    def step(self, action):
        reward = 0; done = False
        if action == 0: reward = -2  # order test
        elif action == 1:  # diagnose cold
            if self.true_diagnosis == 'cold': reward = 10; done = True
            else: reward = -10; done = True
        elif action == 2:  # diagnose flu
            if self.true_diagnosis == 'flu': reward = 10; done = True
            else: reward = -10; done = True
        elif action == 3: reward = -2 # prescribe
        elif action == 4: reward = 0; done = True # refer
        else: reward = -5
        return reward, done

# --- Embedding Model ---
embedder = SentenceTransformer('all-MiniLM-L6-v2')
def embed_sentences(sentences):
    arr = embedder.encode(sentences)
    if arr.shape[1] > EMBED_DIM:
        arr = arr[:,:EMBED_DIM]
    return arr

# --- Keras 3 RL Policy Network with Attention ---
def build_policy_network(state_dim, emb_dim, n_advisors, n_actions):
    state_in = keras.Input(shape=(state_dim,), name="state")
    advisor_emb_in = keras.Input(shape=(n_advisors, emb_dim), name="advisor_emb")
    x = layers.TimeDistributed(layers.Dense(emb_dim, activation='relu'))(advisor_emb_in)
    attn_scores = layers.TimeDistributed(layers.Dense(1))(x)
    attn_scores_flat = layers.Flatten()(attn_scores)
    attn_weights = layers.Activation('softmax', name='attn_weights')(attn_scores_flat)
    attn_weights_exp = layers.Reshape((n_advisors, 1))(attn_weights)
    advisor_context = layers.Dot(axes=1)([attn_weights_exp, x])
    advisor_context = layers.Flatten()(advisor_context)
    concat = layers.Concatenate()([state_in, advisor_context])
    dense = layers.Dense(64, activation='relu')(concat)
    out = layers.Dense(n_actions, activation='softmax')(dense)
    model = keras.Model([state_in, advisor_emb_in], out)
    # Model for extracting attention weights
    attn_model = Model([state_in, advisor_emb_in], attn_weights)
    return model, attn_model

# --- Training Loop: REINFORCE Policy Gradient ---
env = PatientEnv()
policy_net, attn_model = build_policy_network(3, EMBED_DIM, N_ADVISORS, N_ACTIONS)
optimizer = Adam(learning_rate=1e-3)

reward_history = []
for episode in range(N_EPISODES):
    state, diag = env.reset()
    episode_logprobs = []
    episode_rewards = []
    done = False
    step = 0
    while not done:
        # Advisors: negotiation
        advices = get_advisors(state)
        advices = get_advisors(state, advices)
        advisor_embs = embed_sentences(advices)
        advisor_embs = advisor_embs[np.newaxis, ...]
        state_batch = state[np.newaxis, ...]
        # Policy step
        probs = policy_net([state_batch, advisor_embs]).numpy()[0]
        action = np.random.choice(N_ACTIONS, p=probs)
        # Log-prob for policy gradient
        logprob = np.log(probs[action] + 1e-8)
        episode_logprobs.append(logprob)
        # Step in env
        reward, done = env.step(action)
        episode_rewards.append(reward)
        step += 1

    # --- Policy gradient update (REINFORCE) ---
    returns = []
    G = 0
    for r in reversed(episode_rewards):
        G = r + GAMMA * G
        returns.insert(0, G)
    returns = np.array(returns)
    returns = (returns - returns.mean()) / (returns.std() + 1e-8)  # normalize

    # Policy loss (one step, for demo)
    with tf.GradientTape() as tape:
        state, _ = env.reset()
        advices = get_advisors(state)
        advices = get_advisors(state, advices)
        advisor_embs = embed_sentences(advices)
        advisor_embs = advisor_embs[np.newaxis, ...]
        state_batch = state[np.newaxis, ...]
        probs = policy_net([state_batch, advisor_embs], training=True)[0]
        loss = -tf.math.log(probs[action] + 1e-8) * returns[0]
    grads = tape.gradient(loss, policy_net.trainable_weights)
    optimizer.apply_gradients(zip(grads, policy_net.trainable_weights))

    reward_history.append(np.sum(episode_rewards))

    # Print logs
    if episode < 3 or episode % 5 == 0:
        action_names = ["Order test", "Diagnose cold", "Diagnose flu", "Prescribe", "Refer"]
        attn_vals = attn_model([state_batch, advisor_embs]).numpy()[0]
        top_advisor = np.argmax(attn_vals)
        print(f"\n--- Episode {episode} ---")
        print(f"Patient: fever={state[0]}, cough={state[1]}, risk={state[2]}, true_diag={diag}")
        for i, a in enumerate(advices):
            print(f"Advisor {i+1}: {a}")
        print(f"Agent chose: {action_names[action]} (Reward: {reward})")
        print(f"Attention: Advisor {top_advisor+1} most influential ({attn_vals[top_advisor]:.2f})")

# --- Plot reward vs. episode ---
plt.plot(reward_history)
plt.xlabel("Episode")
plt.ylabel("Reward")
plt.title("Keras 3 RL Agent + Groq Llama3 LLM Advisors: Reward vs. Episode")
plt.show()

Reference:

Yao, S., Zhao, X., et al. "Tree of Thoughts: Deliberate Problem Solving with Large Language Models," arXiv preprint arXiv:2305.10601, 2023.
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, Bernard Ghanem. "CAMEL: Communicative Agents for 'Mind' Exploration of Large Scale Language Model Society," arXiv preprint arXiv:2303.17760, 2023.
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, Chi Wang. "AutoGen: Enabling next-generation multi-agent LLM applications," arXiv preprint arXiv:2308.08155, 2023.
Sutton, R. S., & Barto, A. G. "Reinforcement Learning: An Introduction," 2nd Edition, MIT Press, 2018.
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, Anima Anandkumar. "Voyager: An Open-Ended Embodied Agent with Large Language Models," arXiv preprint arXiv:2305.16291, 2023.
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao. "Reflexion: Language Agents with Verbal Reinforcement Learning," arXiv preprint arXiv:2303.11366, 2023.

Exploring AI-ML-NLP

Sunday, July 20, 2025

Next-Gen AI: Multi-Agent LLMs and Policy Gradient RL (Explained)