Awex is a weight synchronization framework between training and inference engines designed for ultimate performance, solving the core challenge of synchronizing training weight parameters to inference models in the RL workflow. It can exchange TB-scale large-scale parameter within seconds, significantly reducing RL model training latency. Main features include:
Blazing synchronization performance: Full synchronization of trillion-parameter models across thousand-GPU clusters within 6 seconds, industry-leading performance;
Unified model adaptation layer: Automatically handles differences in parallelism strategies between training and inference engines and tensor format/layout differences, compatible with multiple model architectures;
Zero-redundancy Resharding transmission and in-place updates: Only transfers necessary shards, updates inference-side memory in place, avoiding reallocation and copy overhead;
Multi-mode transmission support: Supports multiple transmission modes including NCCL, RDMA, and shared memory, fully leveraging NVLink/NVSwitch/RDMA bandwidth and reducing long-tail latency;
Heterogeneous deployment compatibility: Adapts to co-located/separated modes, supports both synchronous and asynchronous RL algorithm training scenarios, with RDMA transmission mode supporting dynamic scaling of inference instances;
Flexible pluggable architecture: Supports customized weight sharing and layout behavior for different models, while supporting integration of new training and inference engines.
GitHub Repo: https://github.com/inclusionAI/asystem-awex
chaokunyang•1h ago