Training a Large Language Model (LLM) agent to master tool-use via Reinforcement Learning (RL) is often the "holy grail" of modern AI development. However, the road to convergence is usually paved with instability, timeouts, and multi-node errors.

At Tencent Youtu Lab, we tackled these challenges by polishing the Agent Lightning framework. Built upon our modifications, we have created a production-ready solution that allows you to train your customized Youtu-Agent, with agent scaffold, in an end-to-end manner. It not only stabilizes RL training but also enables efficient, large-scale distributed learning across 128 GPUs. If you have struggled with entropy explosions or timeout loops when training agents on agent-lightning, this guide is for you. Here is how we transformed the training pipeline for our Youtu-Agent ecosystem to achieve 40% faster iterations and steady convergence.

🚨The Agent RL Dilemma: Scale vs. Stability

We began by attempting to train 7B models on influential, high-complexity tasks like ReTool (Math/Code) and SearchR1 (local search). While using the official Agent Lightning (v0.2.2), we hit two critical walls that will be familiar to many ML engineers:

1. The Concurrency Bottleneck

The standard setup struggled to scale. As we increased the number of agent runners, we encountered a "Timeout Nightmare." The nested structure of the framework led to frequent timeouts ranging from the agent instance up to the agl store, making high-concurrency training nearly impossible. Furthermore, the LiteLLM proxy frequently crashed under load with "maximum connections reached" errors.

2. The "Entropy Explosion"

Even when the infrastructure held up, the math often didn't. We observed Catastrophic Divergence, where policy entropy would explode after just a few steps. The result? The agent's policy degenerated, producing repetitive, unreadable tokens and failing to call tools correctly.

🛠️The Solution: Production-Grade RL

To fix this, we didn't just patch the code; we re-architected the critical paths for production-level stability.

A. Infrastructure: Maximizing Concurrency

We turned the training backend into a resilient service capable of handling massive throughput.

● Production-Grade Server

The current implementation of agentlightning/verl/async_server.py does not build a real FastAPI / Uvicorn server and fails to consider the full OpenAI-compatible responses. In this case, we propose to turn such a minimal async wrapper into a production-grade service with: 1) resilient to port collisions; 2) protected against context length explosions (which cannot be handled properly by the agent framework alone); 3) the configuration for high concurrency; 4) the compatibility with tool-enabled/disabled scenarios; and 5) the heavily instrumented debugging logger.

We replaced the minimal async wrapper with a full FastAPI/Uvicorn implementation. This allows for dynamic port selection (preventing address collisions) and strict concurrency limits to prevent connection overloads. First, we allow selection of a free port at run-time instead of hard-coded one and therefore prevent address already in use issue in the ray distributions. Second, we prepare a life-span context manager for the fastapi server which allows ray to restart a new actor if the process is killed. Third, we introduce the full uvicorn configuration which explicitly considers concurrency and timeout issues, avoiding maximum number of open connections reached errors under load. This is one of the most frequent problems met during production-level server deployment. Fourth, we modify the init_engine() function following the VeRL native implementation, allowing the configs of tool calling (e.g., tool_config_path and tool_parser) to be explicitly set and passed instead of predefined placeholders. Fifth, we pay attention to the overlong problem of rollout under agent framework. We find that the agent scaffold itself has no idea about the currently allowed maximum token length due to the chat template wrapping. In this case, it becomes a natural choice to enforce length check at the chat completion server side. We add the token count to infer the max number of newly generated tokens at each turn, which keeps all training jobs alive without issues like input longer than model context. Last but not least, we adopt logger and printing of traceback errors for debugging-friendly purposes.

async def _start_fastapi_server(self):
    @asynccontextmanager
    async def lifespan(app: fastapi.FastAPI):
        print(f"FastAPI listen on {self.address}:{self.port}")
        self.server_ready.set()
        yield
        print("FastAPI shutdown, maybe address already in use, exit process immediately.")
        os._exit(-1)
    app = fastapi.FastAPI(lifespan=lifespan)
    app.router.add_api_route("/v1/chat/completions", self.chat_completion, methods=["POST"])
    self.port = _get_free_port()
    config = uvicorn.Config(
        app,
        host=["::", "0.0.0.0"],
        port=self.port,
        log_level="warning",
        limit_concurrency=int(os.environ.get("UVICORN_LIMIT_CONCURRENCY", 200)),
        backlog=int(os.environ.get("UVICORN_BACKLOG", 2048)),
        timeout_keep_alive=int(os.environ.get("UVICORN_TIMEOUT_KEEP_ALIVE", 30)),
    )
    server = uvicorn.Server(config)
    await server.serve()

async def init_engine(self):
    ...

● Hierarchical Timeout Logic

The original setting of timeout value does not consider its hierarchical nature. This would be very tricky as one could "luckily" pass all the timeout limit for small LLMs and infrequent tool calls (or efficient environment) but would "definitely" fail with the increase of number of tool calls, LLM size, and environment latency. Therefore, we propose to set the timeout wisely and allow the retry mechanism work without negative interference.

We implemented a "smart" timeout hierarchy (Agent < Task < LLM) to ensure that long-running tasks in complex environments don't trigger false failures during rollout.