First published on October 13th, 2025, last updated on November 4th, 2025.
<aside>
TL;DR
In agentic LLM training, many trajectories share common prefixes that form a prefix tree. We propose Tree Training — a novel paradigm that computes each shared prefix only once and reuses its intermediate results across related branches during both forward and backward passes, substantially improving computation efficiency in large-scale agentic training. This is achieved through:
This is achieved via:
Experiments on multiple open-source models demonstrate up to 3.9x reduction in total training time, enabling more efficient agentic LLM SFT and RL training.
Paper:
Tree Training: Accelerating Agentic LLMs via Shared Trajectory Reuse
arXiv:2511.00413 · November 2025
</aside>
In agentic LLM training, an agent's behavior is often highly diverse, so its trajectory history is usually not a linear trajectory like prompt1 → response1 → prompt2 → response2. Instead, a single task can produce multiple trajectories, many of which share common prefixes. These trajectories can be naturally organized into a prefix tree.

Figure 1. Left: Some trajectories share common prefixes (e.g. all five trajectories), and when considering smaller subsets of these trajectories, the shared prefixes can become longer (e.g. trajectories 1, 2, 3). Right: merge trajectories into a prefix tree, where shared prefixes are organized hierarchically.
In this scenario, we realized that the computation of prefixes shared across multiple trajectories could be performed just once, which would significantly improve training throughput. This idea is similar to the prefix caching used for LLM inference. However, due to the involvement of back propagation during training, we cannot directly reuse cached results; doing so would ignore the gradient contributions from suffix tokens to the prefix tokens, leading to incorrect computations.
Our goal is computing each shared prefix only once during both forward and backward and thus improve agentic LLM training throughput.
To achieve this, we took the following steps:
Maximize shared prefix reuse through Tree Packing:
Since there are many training trajectories, even after merging them into a tree structure, we usually can’t fit the entire tree into one batch due to GPU memory constraints. Using a combination of dynamic programming and greedy algorithms, we proposed a practical approach to pack the tree under memory constraints which maximizes the reuse of shared prefixes.
Design Gradient Scaler to ensure correct gradient computation:
During back propagation, the gradient contribution of a shared prefix differs across different trajectories. We implemented a tree-structured gradient scaler that mathematically ensures each shared prefix contributes correctly to the gradients.
Develop custom kernels:
We implemented an efficient shared-prefix mask attention and modified position embedding for flattened tree packed data, ensuring both training correctness and high throughput.
When training large language models, we often find that many samples share the same beginning — for example, the same question prompt or a similar prefix of an answer.
Instead of processing these identical prefixes repeatedly, we can merge them into a tree structure, where shared parts naturally become internal nodes.
By reusing computations along these shared prefixes, we can avoid a large amount of redundant compute and accurate LLM training.
However, GPU memory is limited. We can’t always fit the entire tree into one batch.