3.1 Release Note

The PaddlePaddle framework version 3.2 has further enhanced its performance in large model training and inference, hardware adaptation, and support for mainstream large models and high-performance acceleration libraries.

  • In terms of large model training, the PaddlePaddle framework has undergone upgrades in three aspects: computation, parallel strategy, and fault tolerance:

  • From the perspective of basic computational performance, FlashMask V3, a sparse mask attention computation with overlapping storage and computation, is proposed to maximize the computational efficiency of Attention. Additionally, it also implements an efficient lossless training technique with FP8 mixed precision effect.

  • At the level of distributed parallel strategy, a dynamically adaptive VRAM offloading strategy is proposed to achieve optimal balance between memory and computation. Combined with an innovatively designed VRAM-friendly pipeline parallel scheduling, it further reduces VRAM overhead.

  • Enhanced the native fault tolerance capability of the framework, implemented a large-scale cluster training fault tolerance system, which can monitor online silent data corruption and other difficult-to-detect faults without affecting training efficiency, and implemented a highly available checkpoint disaster recovery method to reduce the loss of interruption recovery.

  • In terms of hardware adaptation, we have comprehensively upgraded the plug-in adaptation solution for CUDA-like chips. In terms of device resource management and scheduling, as well as high-performance collective communication libraries, management interface upgrades and communication capability enhancements have been made for CUDA-like chips, with a particular emphasis on enhancing distributed communication capabilities, aligning XCCL with the various structures and functions of NCCL.

  • Added a registration mechanism for CUDA-like operators. Taking Muxi adaptation as an example, operator kernel registration can be completed with just one line of code based on the reuse of GPU operator kernels. According to statistical calculations, the reuse rate of operator kernels can reach up to 92%, significantly reducing hardware adaptation costs. In terms of user experience, the focus has been placed on enhancing compatibility, encompassing the development of interfaces compatible with industry practices, compatibility with the SafeTensors model format, and compatibility with third-party high-performance acceleration libraries.

  • The newly added and modified development interfaces are compatible with industry practices, with a series of new APIs and aliases introduced, along with new parameter aliases and both proprietary and generic parameters.

  • Fully compatible with the Safetensors model format. The newly added FlexCheckpoint mechanism supports automatic parameter re-sharding across distributed strategies and model structures, significantly reducing the cost of weight conversion and thereby enhancing the end-to-end training and inference development efficiency of large models.

  • The system has systematically enhanced its interface compatibility and operator registration capabilities, enabling one-click import of high-performance acceleration libraries. These libraries can be directly reused in PaddlePaddle's model training and inference acceleration processes without requiring code modifications.

1. user experience

New features

  • New APIs: paddle.msort, paddle.ravel, paddle.nn.functional.dropout1d, paddle.Tensor.type_as, paddle.Tensor.requires_grad, paddle.view_as_complex, paddle.view_as_real, paddle.nn.Parameter, paddle.broadcast_shapes, paddle.range, paddle.as_tensor, paddle.scatter_reduce/scatter_reduce_, paddle.scatter_add, paddle.tensor, paddle.softmax, paddle.Tensor.softmax, paddle.rand_like, paddle.is_autocast_enabled, paddle.get_autocast_gpu_dtype, paddle.Tensor.repeat, paddle.permute. #74421,#74439,#74444,#74454,#74459,[#74491]( https://github.com/PaddlePaddle/Paddle/pull/74491 )[# 74466]( https://github.com/PaddlePaddle/Paddle/pull/74466 ),#74438,#74594,#74542,#74694,#74564,#74540,#74586,#74651,#74807,#74632,#74834,#74952,#74772,#74441,#74561,#74525

  • Added a series of APIs under paddle.compat.* to support common usage in the industry and facilitate code migration, including paddle.compat.median, paddle.compat.nanmedian, paddle.compat.softmax, paddle.compat.sort, paddle.compat.split, paddle.compat.min/max, and paddle.compat.Unfold. #74865, #74874

  • Added a series of initialization APIs to support commonly used parameter initialization methods in the industry, including paddle.nn.init.kaiming_uniform_, paddle.nn.init.xavier_uniform_, paddle.nn.init.uniform_, paddle.nn.init.kaiming_normal_, paddle.nn.init.xavier_normal_, paddle.nn.init.normal_, paddle.nn.init.calculate_gain, paddle.nn.init.constant_, paddle.nn.init.dirac_, paddle.nn.init.eye_, paddle.nn.init.ones_, paddle.nn.init.orthogonal_, paddle.nn.init.trunc_normal_, and paddle.nn.init.zeros_. #74478

  • Added usage of parameter aliases in API, allowing for more flexible input options such as x or input. This includes functions like paddle.maximum, paddle.minimum, paddle.sqrt, paddle.topk, paddle.polar, paddle.stack, paddle.cos, paddle.floor, paddle.log, paddle.pow, paddle.rsqrt, paddle.sign, paddle.sin, paddle.multiply, and paddle.where. #74683, #74795, #74887, #74592

  • paddle.Tensor now supports multiple initialization methods, enabling flexible Tensor creation. #74619, #75022, #75065

  • The API has added some proprietary parameters to enhance existing functions, including paddle.nn.functional.gelu, paddle.divide/div/div_, paddle.add, paddle.Tensor.copy_, paddle.norm, paddle.linalg.norm, paddle.nn.functional.silu, and paddle.repeat_interleave. #74485, #74562, #74420, #74768, #74855, #74903, #74788, #74631, #74947

  • The API has added some common parameters: out, device, dtype, requires_grad, pin_memory, and bias, enhancing the existing functionality. These include paddle.zeros, paddle.zeros_like, paddle.ones, paddle.ones_like, paddle.arange, paddle.eye, paddle.empty, paddle.empty_like, paddle.full, paddle.full_like, paddle.randn, paddle.Tensor.new_full, paddle.Tensor.new_empty, paddle.Tensor.new_ones, paddle.Tensor.new_zeros, paddle.tril/triu, paddle.bmm, paddle.nn.Conv1D/Conv2D/Conv3D/Embedding, paddle.diff, paddle.cumsum, paddle.var, paddle.multinomial, and paddle.mean. #74477,#74526,#74711,#74582,#74624,#74849,#74612,#74875,#74641,#74949,#74918,#74914,#74934,#74920,#74955,#74226,#74946

  • Added aliases to APIs to support more calling methods. These include paddle.Tensor.mul_/mul, paddle.autograd.Function, paddle.argwhere, paddle.cat, paddle.clamp, paddle.ger, paddle.take_along_dim, paddle.linalg.matmul, paddle.special.logsumexp, paddle.concatenate, paddle.eq/gt, paddle.Tensor.take_along_dim, and paddle.nn.Conv1d/Conv2d/Conv3d, etc. #74493, #74569, #74870

Bug fixes

  • Fixed the precision issue of paddle.nanmedian in PaddlePaddle. #74263

  • Fixed the issue of paddle.distributed.fleet.utils.hybrid_parallel_util.fused_allreduce_gradients in 0-D scenarios. #74957

  • Fixed the issue of paddle.matmul in distributed mode. #74989

Enhanced functionality

  • For scenarios involving the return of multiple Tensor objects, the experience has been optimized through encapsulation using the Paddle data structure, including paddle.topk.#74931

  • Create a class API to support the usage of variable-sized parameters. #74494

Documents

Other

2. Basic execution architecture

New features

  • Support for dynamic graphs. #74484

  • Support for safetensors. #74642, #74609, #75049

  • Added offloader to optimize computation efficiency. #74837

  • Added API support for forward computation of conv_transpose. #74431

  • Added offloader to optimize computation efficiency. #74837

  • The inference deployment has added w4afp8 quantization inference, supporting w4afp8 quantization weight pure permutation and all2all communication #74270

Bug fixes

Enhanced functionality

Deprecated

Other

  • Update patch version. #74940

3. Distributed & automatic parallelism

Parallel strategy

In version 3.2, we have made multiple enhancements to the pipeline parallelism feature, including implementing support for dictionary parameter passing and extending the compatibility of Pipeline Layer and SharedLayerDesc with non-pipeline parallelism. Additionally, we have fixed several critical issues, such as IPC API exceptions for large-sized tensors, evaluation batches and non-computational losses in pipeline parallelism, gradient release errors in MoE models, hang issues caused by NCCL communication reconstruction in PP scenarios, and event management errors in dual-pipeline parallelism. Furthermore, we have conducted various performance optimizations, improved the computation overlap efficiency of dual-pipeline parallelism to enhance training performance, and upgraded the clear_param_storage method to support the clearing and resetting operations of multiple color collections in sharding mode.

New Features

  • Implement support for dictionary parameter passing in Pipeline Parallel. #74574, #74867

  • Pipeline Layer and SharedLayerDesc support non-pipeline parallelism (nonpp parallel). #74573

Bug fixes

  • Fixed the IPC API issue with large-sized tensors. #74472

  • Fixed issues related to evaluation batch and non-compute_loss in pipeline parallelism. #74170

  • Fixed the gradient release issue on MoE model. #74972

  • Fixed the hang issue when rebuilding NCCL comm in the pp scenario. #73625

  • Fixed the event management error in dual pipeline parallelism (dual pp). #74158

Optimization and improvement

  • Optimize the efficiency of computation overlap in parallel dual pipelines to enhance training performance. #74527

  • Upgrade the clear_param_storage method to support the clearing and resetting of multiple color collections under sharding. #74741

Automatic parallelism

Functional improvements

  • Support the default splitting derivation rule for the same dimension of distributed tensors when it is split by multiple mesh dimensions. #74396

  • Improved the slicing derivation rule of the reshape operator to support scenarios where the same dimension of a distributed tensor is sliced by multiple mesh dimensions. #74352, #74579, #74565

  • Support changing the mesh of a tensor without altering the distributed tensor data. #74248

Bug fixes

  • Fixed the bug of repeatedly creating communication groups when calling the get_group method of ProcessMesh. #73099

  • Fixed the bug in the get_local_slices method in the MoE scenario. #74705

  • Fixed the bug of gradient clipping in the MoE scenario. #74916

  • Fixed the bug where the stop_gradient parameter could not be passed between different stages in the pipeline parallel scenario. #73459

  • Fixed the accuracy bug of gradient clipping in parallel pipeline scenarios. #74409

  • Fixed the bug of generating redundant outputs in the dynamic graph pipeline parallel scenario. #74913

  • Fixed the bug that the operators moe_combine and moe_gate_dispatch did not work in the MoE scenario. #74645

Other

  • Support accuracy alignment for manual and automatic parallelism of data loaders. #73941

  • Optimize the dynamic graph pipeline parallel scheduling logic. #74720

Communication Library

In version 3.2, we fixed an error in DeepEP's support for sm90 compilation, added a pre-allocation function to the video memory allocation requested by DeepEP, and upgraded its intranode and internode computation kernels, further optimizing performance and stability.

Bug fixes

  • Fixed a bug in DeepEP support for sm90 compilation. #74762

Functional improvements

  • Added pre-allocation function for the GPU memory allocation requested by DeepEP. #74465

  • Upgraded the intranode and internode computation kernels of DeepEP. #74284

4. Operator mechanism

New features

Bug fixes

Enhanced functionality

Performance optimization

  • FP8 computation optimization. #74471, #74684, #74911

  • Basic operator performance optimization. #74442, #74638

  • Support fa3 variable-length sequence reverse computation and optimize forward API. #73831

  • Added FlashMask V2 function. #74729

Documents

  • Fixed issues with English documentation and copyright year. #74737

Other

  • The WITH_XPU_FFT option is enabled by default on XPU hardware. #74699

5. Hardware adaptation

Improved CUDA-like hardware integration solution

  • The CUDA-like hardware access solution supports the reuse of cuBlas kernels #74591,

  • Fix known issues in the CUDA-like hardware access solution #74397, #74411, #74428, #74877, #74939

Main warehouse supports multiple hardware for single testing

New Custom Device API Support

6. Installation environment

Bug fixes

  • Fixed the bug in flashattent compilation cache. #74388

  • Fixed the bug where site.USER_SITE was None. #74373

  • Fixed the compilation bug of gtest in multi-architecture Linux systems. #74723

  • Fixed multiple compilation errors in DEBUG mode when WITH_GPU=ON. #74401

  • Fixed the compilation bug of CUDA12.6 under Windows. #74990

  • Fixed the bug in the api-benchmark baseline pipeline. #74770

  • Fixed the bug in the api-benchmark baseline pipeline. #74778

  • Fixed the bug in the api-benchmark baseline pipeline. #74779

  • Fixed the bug in the api-benchmark baseline pipeline. #74780

  • Fixed the bug in the api-benchmark baseline pipeline. #74800

  • Fixed the bug in the api-benchmark baseline pipeline. #74803

Other

  • Disable the test_custom_contiguous unit test. #74337

  • Support for timed triggering of baseline tasks in the slice pipeline. #74419

  • Support manually specifying the pr for adding slice recording baselines. #74445

  • Check if there are any issues in the code. #74460

  • Support CI PaddleX tasks on XPU. #74426

  • Support slice pipeline exemption mechanism. #74482

  • Updated the Paddle base image. #73423

  • Fixed Ninja version 1.11 for Windows. #74590

  • Support adding the ability to close PRs and cancel CIs. #74604

  • Support for quickly skipping all CI. #74696

  • Add an api-benchmark baseline pipeline. #74690

  • Update the nccl version. #74809

  • Update the RD list for the approve pipeline. #74838

  • Update the RD list for the approve pipeline. #74902

  • Update safetensor to the mirror. #74904

  • Added the compilation flag for flashatten. #74959

  • Temporarily disable the win-inference pipeline. #74980

  • Support for compiling phi dynamic libraries on Windows. #74950

7. List of contributors

AIbin, Ayakouji, baiyue, baoqiwen, Chang Lu, Chen Zhiyang, co63oc, cyberslack_lee, cyy536, datutu-L, Deng Haodong, Difer, Eddie-Wang, enzodechine, fangfangssj, feri, fxyfxy777, ggggxm, GoldPancake, gouzil, Gu Shiwei, Haze188 灏喆, hohdiy, hong, HU Shenwei, huangjiyi, HydrogenSulfate, kjagsdq, LCStayingdullCircuit, Leo Guo, lightbrother, liufengwei0103, liuruyan, LiYuRio, LLSGYN, Lucas, Luckycheng222, lzy, Nana, Nyakku Shigure, ooo oo, Qianyue He, risemeup1, Ruibiao Chen, Ryan, Shuhao Liang, sneaxiy, Starrysea996, SUN Dong, Tao Luo, Tian, tianhaodongbd, tianshuo78520a, umiswing, waliwali777, wanghuancoder, Wenhao.Dai, wyw, XiaoguangHu, xiaoguoguo626807, xingmingyyj, Yichen Zhang, Yohanna, yongqiangma, Yuan Xiaolan, YUNSHEN XIE, Yuntao Nie, Yuqiang Ge, Yutian Rao, Zero Rains, Zhan Rongrui, Zhang Ting, zhanghonggeng, Zhaowu Pan, zhengshengning, ZhenxingLi, Zhou Xin, zhupengyang, zhwesky2010, Zichao, zty-king, Zx, zyfncg, zzm, 周周周, 正在学习, 苍天荒