3.1 Release Note

The PaddlePaddle framework version 3.2 has further enhanced its performance in large model training and inference, hardware adaptation, and support for mainstream large models and high-performance acceleration libraries.

In terms of large model training, the PaddlePaddle framework has undergone upgrades in three aspects: computation, parallel strategy, and fault tolerance:
From the perspective of basic computational performance, FlashMask V3, a sparse mask attention computation with overlapping storage and computation, is proposed to maximize the computational efficiency of Attention. Additionally, it also implements an efficient lossless training technique with FP8 mixed precision effect.
At the level of distributed parallel strategy, a dynamically adaptive VRAM offloading strategy is proposed to achieve optimal balance between memory and computation. Combined with an innovatively designed VRAM-friendly pipeline parallel scheduling, it further reduces VRAM overhead.
Enhanced the native fault tolerance capability of the framework, implemented a large-scale cluster training fault tolerance system, which can monitor online silent data corruption and other difficult-to-detect faults without affecting training efficiency, and implemented a highly available checkpoint disaster recovery method to reduce the loss of interruption recovery.
In terms of hardware adaptation, we have comprehensively upgraded the plug-in adaptation solution for CUDA-like chips. In terms of device resource management and scheduling, as well as high-performance collective communication libraries, management interface upgrades and communication capability enhancements have been made for CUDA-like chips, with a particular emphasis on enhancing distributed communication capabilities, aligning XCCL with the various structures and functions of NCCL.
Added a registration mechanism for CUDA-like operators. Taking Muxi adaptation as an example, operator kernel registration can be completed with just one line of code based on the reuse of GPU operator kernels. According to statistical calculations, the reuse rate of operator kernels can reach up to 92%, significantly reducing hardware adaptation costs. In terms of user experience, the focus has been placed on enhancing compatibility, encompassing the development of interfaces compatible with industry practices, compatibility with the SafeTensors model format, and compatibility with third-party high-performance acceleration libraries.
The newly added and modified development interfaces are compatible with industry practices, with a series of new APIs and aliases introduced, along with new parameter aliases and both proprietary and generic parameters.
Fully compatible with the Safetensors model format. The newly added FlexCheckpoint mechanism supports automatic parameter re-sharding across distributed strategies and model structures, significantly reducing the cost of weight conversion and thereby enhancing the end-to-end training and inference development efficiency of large models.
The system has systematically enhanced its interface compatibility and operator registration capabilities, enabling one-click import of high-performance acceleration libraries. These libraries can be directly reused in PaddlePaddle's model training and inference acceleration processes without requiring code modifications.

1. user experience

New features

New APIs: paddle.msort, paddle.ravel, paddle.nn.functional.dropout1d, paddle.Tensor.type_as, paddle.Tensor.requires_grad, paddle.view_as_complex, paddle.view_as_real, paddle.nn.Parameter, paddle.broadcast_shapes, paddle.range, paddle.as_tensor, paddle.scatter_reduce/scatter_reduce_, paddle.scatter_add, paddle.tensor, paddle.softmax, paddle.Tensor.softmax, paddle.rand_like, paddle.is_autocast_enabled, paddle.get_autocast_gpu_dtype, paddle.Tensor.repeat, paddle.permute. #74421,#74439,#74444,#74454,#74459,[#74491]( https://github.com/PaddlePaddle/Paddle/pull/74491 ）[# 74466]（ https://github.com/PaddlePaddle/Paddle/pull/74466 ),#74438,#74594,#74542,#74694,#74564,#74540,#74586,#74651,#74807,#74632,#74834,#74952,#74772,#74441,#74561,#74525
Added a series of APIs under paddle.compat.* to support common usage in the industry and facilitate code migration, including paddle.compat.median, paddle.compat.nanmedian, paddle.compat.softmax, paddle.compat.sort, paddle.compat.split, paddle.compat.min/max, and paddle.compat.Unfold. #74865, #74874
Added a series of initialization APIs to support commonly used parameter initialization methods in the industry, including paddle.nn.init.kaiming_uniform_, paddle.nn.init.xavier_uniform_, paddle.nn.init.uniform_, paddle.nn.init.kaiming_normal_, paddle.nn.init.xavier_normal_, paddle.nn.init.normal_, paddle.nn.init.calculate_gain, paddle.nn.init.constant_, paddle.nn.init.dirac_, paddle.nn.init.eye_, paddle.nn.init.ones_, paddle.nn.init.orthogonal_, paddle.nn.init.trunc_normal_, and paddle.nn.init.zeros_. #74478
Added usage of parameter aliases in API, allowing for more flexible input options such as x or input. This includes functions like paddle.maximum, paddle.minimum, paddle.sqrt, paddle.topk, paddle.polar, paddle.stack, paddle.cos, paddle.floor, paddle.log, paddle.pow, paddle.rsqrt, paddle.sign, paddle.sin, paddle.multiply, and paddle.where. #74683, #74795, #74887, #74592
paddle.Tensor now supports multiple initialization methods, enabling flexible Tensor creation. #74619, #75022, #75065
The API has added some proprietary parameters to enhance existing functions, including paddle.nn.functional.gelu, paddle.divide/div/div_, paddle.add, paddle.Tensor.copy_, paddle.norm, paddle.linalg.norm, paddle.nn.functional.silu, and paddle.repeat_interleave. #74485, #74562, #74420, #74768, #74855, #74903, #74788, #74631, #74947
The API has added some common parameters: out, device, dtype, requires_grad, pin_memory, and bias, enhancing the existing functionality. These include paddle.zeros, paddle.zeros_like, paddle.ones, paddle.ones_like, paddle.arange, paddle.eye, paddle.empty, paddle.empty_like, paddle.full, paddle.full_like, paddle.randn, paddle.Tensor.new_full, paddle.Tensor.new_empty, paddle.Tensor.new_ones, paddle.Tensor.new_zeros, paddle.tril/triu, paddle.bmm, paddle.nn.Conv1D/Conv2D/Conv3D/Embedding, paddle.diff, paddle.cumsum, paddle.var, paddle.multinomial, and paddle.mean. #74477,#74526,#74711,#74582,#74624,#74849,#74612,#74875,#74641,#74949,#74918,#74914,#74934,#74920,#74955,#74226,#74946
Added aliases to APIs to support more calling methods. These include paddle.Tensor.mul_/mul, paddle.autograd.Function, paddle.argwhere, paddle.cat, paddle.clamp, paddle.ger, paddle.take_along_dim, paddle.linalg.matmul, paddle.special.logsumexp, paddle.concatenate, paddle.eq/gt, paddle.Tensor.take_along_dim, and paddle.nn.Conv1d/Conv2d/Conv3d, etc. #74493, #74569, #74870

Bug fixes

Fixed the precision issue of paddle.nanmedian in PaddlePaddle. #74263
Fixed the issue of paddle.distributed.fleet.utils.hybrid_parallel_util.fused_allreduce_gradients in 0-D scenarios. #74957
Fixed the issue of paddle.matmul in distributed mode. #74989

Enhanced functionality

For scenarios involving the return of multiple Tensor objects, the experience has been optimized through encapsulation using the Paddle data structure, including paddle.topk.#74931
Create a class API to support the usage of variable-sized parameters. #74494

Documents

Added or fixed documentation. #74453, #74846, #74982

Other

Optimization related to code style. #74654,#74655,#74665,#74660,#74667,#74664,#74662,#74661,#74658,#74657,#74666,#74659,#74663,#74656,#74673,#74672,#74671,#74674,#74675,#74670,#74669,#74677,#74709,#74714,#74712,#74713,#74704,#74746,#74748,#74743,#74742,#74744,#74745,#74747,#74794,#74789,#74793,#74786,#74791,#74787,#74827,#74608,#74288,#74287,#74385,#74395,#74475,#74647
Optimization related to MKLDNN/ONEDNN. #74299,#74244,#74230,#74314,#74327,#74325,#74326,#74315,#74399,#74398,#74393,#74392,#74367,#74391,#74423,#74424,#74436,#74417,#74410,#74473,#74458,#74501,#74487,#74502,#74513,#74518,#74516,#74507,#74504,#74505,#74509,#74535,#74536,#74517,#74503,#74557,#74550,#74575,#74587,#74576,#74588,#74549,#74581,#74583,#74628,#74630,#74635,#74679,#74648,#74127,#74636,#74552,#74551,#74678,#74680,#74730,#74751,#74895,#74821,#74897,#74734
Optimizations related to code implementation, variable and file renaming. #74309, #74597, #74613, #74376, #74479, #74960, #74968, #74977
Optimizations related to unit tests, and bug fixes for unit test issues. #74595
Compilation-related optimizations and CI issue fixes. #74356, #74936
Optimize debugging and printing information, and optimize error reporting information. #74765, #74381, #74384, #74386, #74387, #74383, #74519, #74520, #74468
Optimizations related to custom operators. #74402
Distributed FlexCheckpoint support. #74966, #74593, #74785, #74814

2. Basic execution architecture

New features

Support for dynamic graphs. #74484
Support for safetensors. #74642, #74609, #75049
Added offloader to optimize computation efficiency. #74837
Added API support for forward computation of conv_transpose. #74431
Added offloader to optimize computation efficiency. #74837
The inference deployment has added w4afp8 quantization inference, supporting w4afp8 quantization weight pure permutation and all2all communication #74270

Bug fixes

Core framework and infrastructure optimization. #74336, #74554, #74634
Calculation accuracy and type handling. #74278, #74222, #74830
Optimization of dynamic dimension check logic. #74633, #74650
Memory and illegal access fixes. #74347, #73443, #74953
Fixed printing of error/warning messages. #74474, #74533, #74685, #74721, #74754
Code quality and documentation correction. #74378, #74828
Fixed the processing logic of the flashmask API. #74928
Fixed the issue where splitting CudaGraph subgraphs did not take effect in dynamic-to-static mode. (#74749)

Enhanced functionality

C++ extension development. #74338
Optimization of FlexCP function. #74752, #74981
Optimize memory allocation. #74463

Deprecated

Clean up old IR-related unit tests for dynamic, static, and transition scenarios. #74698, #74715, #74718, #74782, #74962

Other

Update patch version. #74940

3. Distributed & automatic parallelism

Parallel strategy

In version 3.2, we have made multiple enhancements to the pipeline parallelism feature, including implementing support for dictionary parameter passing and extending the compatibility of Pipeline Layer and SharedLayerDesc with non-pipeline parallelism. Additionally, we have fixed several critical issues, such as IPC API exceptions for large-sized tensors, evaluation batches and non-computational losses in pipeline parallelism, gradient release errors in MoE models, hang issues caused by NCCL communication reconstruction in PP scenarios, and event management errors in dual-pipeline parallelism. Furthermore, we have conducted various performance optimizations, improved the computation overlap efficiency of dual-pipeline parallelism to enhance training performance, and upgraded the clear_param_storage method to support the clearing and resetting operations of multiple color collections in sharding mode.

New Features

Implement support for dictionary parameter passing in Pipeline Parallel. #74574, #74867
Pipeline Layer and SharedLayerDesc support non-pipeline parallelism (nonpp parallel). #74573

Bug fixes

Fixed the IPC API issue with large-sized tensors. #74472
Fixed issues related to evaluation batch and non-compute_loss in pipeline parallelism. #74170
Fixed the gradient release issue on MoE model. #74972
Fixed the hang issue when rebuilding NCCL comm in the pp scenario. #73625
Fixed the event management error in dual pipeline parallelism (dual pp). #74158

Optimization and improvement

Optimize the efficiency of computation overlap in parallel dual pipelines to enhance training performance. #74527
Upgrade the clear_param_storage method to support the clearing and resetting of multiple color collections under sharding. #74741

Automatic parallelism

Functional improvements

Support the default splitting derivation rule for the same dimension of distributed tensors when it is split by multiple mesh dimensions. #74396
Improved the slicing derivation rule of the reshape operator to support scenarios where the same dimension of a distributed tensor is sliced by multiple mesh dimensions. #74352, #74579, #74565
Support changing the mesh of a tensor without altering the distributed tensor data. #74248

Bug fixes

Fixed the bug of repeatedly creating communication groups when calling the get_group method of ProcessMesh. #73099
Fixed the bug in the get_local_slices method in the MoE scenario. #74705
Fixed the bug of gradient clipping in the MoE scenario. #74916
Fixed the bug where the stop_gradient parameter could not be passed between different stages in the pipeline parallel scenario. #73459
Fixed the accuracy bug of gradient clipping in parallel pipeline scenarios. #74409
Fixed the bug of generating redundant outputs in the dynamic graph pipeline parallel scenario. #74913
Fixed the bug that the operators moe_combine and moe_gate_dispatch did not work in the MoE scenario. #74645

Other

Support accuracy alignment for manual and automatic parallelism of data loaders. #73941
Optimize the dynamic graph pipeline parallel scheduling logic. #74720

Communication Library

In version 3.2, we fixed an error in DeepEP's support for sm90 compilation, added a pre-allocation function to the video memory allocation requested by DeepEP, and upgraded its intranode and internode computation kernels, further optimizing performance and stability.

Bug fixes

Fixed a bug in DeepEP support for sm90 compilation. #74762

Functional improvements

Added pre-allocation function for the GPU memory allocation requested by DeepEP. #74465
Upgraded the intranode and internode computation kernels of DeepEP. #74284

4. Operator mechanism

New features

API compatibility support. #74506, #74676, #74558, #74572, #74691, #74703, #74750, #74757, #74802, #74546, #74547, #74802, #74859, #74910, #74873, #74882, #74901, #74899, #74449
Added fused_partial_rope operator. #74577

Bug fixes

0-size Tensor related fixes. #74295, #74305, #74323, #74354
Major Tensor-related fixes. #74242, #74293, #74289, #74279, #74330, #74329, #74342, #74369, #74370, #74404, #74537, #74451, #74172, #74324, #74964, #74360, #74379, #74377, #74380, #74362, #74197
API compatibility-related fixes. #74764, #74869, #74935
[Open Source Task] Investigate and resolve precision issues in Paddle CPU/GPU Kernels. #74149, #74598, #74719, #74625, #74555
Other important fixes. #74282, #74313, #74303, #74306, #74298, #74044, #74290, #74348, #74364, #74332, #74224, #74382, #74406, #74434, #74448, #74457, #74322, #74530, #74716, #74839, #74842, #74854, #74919, #74767, #75003

Enhanced functionality

Improved API compatibility. #74456, #74480, #74523, #74490, #74548, #74596, #74568, #74559, #74629, #74623, #74700, #74643, #74602, #74783, #74781, #74735, #74725, #74815, #74856, #74925, #74545, #74932, #74784
Slice/stride related optimizations. #74731, #74740, #74769, #74810, #74841, #74954, #74888, #74944, #74312, #74291, #74271, #74320, #74344, #74727, #74637
Operator optimization and CUDA support. #74693, #74922, #74967
Improved debugging information and compatibility enhancements. #74372, #74622
Operator function expansion and optimization. #74790, #74979

Performance optimization

FP8 computation optimization. #74471, #74684, #74911
Basic operator performance optimization. #74442, #74638
Support fa3 variable-length sequence reverse computation and optimize forward API. #73831
Added FlashMask V2 function. #74729

Documents

Fixed issues with English documentation and copyright year. #74737

Other

The WITH_XPU_FFT option is enabled by default on XPU hardware. #74699

5. Hardware adaptation

Improved CUDA-like hardware integration solution

The CUDA-like hardware access solution supports the reuse of cuBlas kernels #74591,
Fix known issues in the CUDA-like hardware access solution #74397, #74411, #74428, #74877, #74939

Main warehouse supports multiple hardware for single testing

Single test supports multiple hardware #74349, #74363, #74806, #74868, #74820, #74927

New Custom Device API Support

Added support for Custom Device API #74308, #74371, #74539

6. Installation environment

Bug fixes

Fixed the bug in flashattent compilation cache. #74388
Fixed the bug where site.USER_SITE was None. #74373
Fixed the compilation bug of gtest in multi-architecture Linux systems. #74723
Fixed multiple compilation errors in DEBUG mode when WITH_GPU=ON. #74401
Fixed the compilation bug of CUDA12.6 under Windows. #74990
Fixed the bug in the api-benchmark baseline pipeline. #74770
Fixed the bug in the api-benchmark baseline pipeline. #74778
Fixed the bug in the api-benchmark baseline pipeline. #74779
Fixed the bug in the api-benchmark baseline pipeline. #74780
Fixed the bug in the api-benchmark baseline pipeline. #74800
Fixed the bug in the api-benchmark baseline pipeline. #74803

Other

Disable the test_custom_contiguous unit test. #74337
Support for timed triggering of baseline tasks in the slice pipeline. #74419
Support manually specifying the pr for adding slice recording baselines. #74445
Check if there are any issues in the code. #74460
Support CI PaddleX tasks on XPU. #74426
Support slice pipeline exemption mechanism. #74482
Updated the Paddle base image. #73423
Fixed Ninja version 1.11 for Windows. #74590
Support adding the ability to close PRs and cancel CIs. #74604
Support for quickly skipping all CI. #74696
Add an api-benchmark baseline pipeline. #74690
Update the nccl version. #74809
Update the RD list for the approve pipeline. #74838
Update the RD list for the approve pipeline. #74902
Update safetensor to the mirror. #74904
Added the compilation flag for flashatten. #74959
Temporarily disable the win-inference pipeline. #74980
Support for compiling phi dynamic libraries on Windows. #74950

7. List of contributors

AIbin, Ayakouji, baiyue, baoqiwen, Chang Lu, Chen Zhiyang, co63oc, cyberslack_lee, cyy536, datutu-L, Deng Haodong, Difer, Eddie-Wang, enzodechine, fangfangssj, feri, fxyfxy777, ggggxm, GoldPancake, gouzil, Gu Shiwei, Haze188 灏喆, hohdiy, hong, HU Shenwei, huangjiyi, HydrogenSulfate, kjagsdq, LCStayingdullCircuit, Leo Guo, lightbrother, liufengwei0103, liuruyan, LiYuRio, LLSGYN, Lucas, Luckycheng222, lzy, Nana, Nyakku Shigure, ooo oo, Qianyue He, risemeup1, Ruibiao Chen, Ryan, Shuhao Liang, sneaxiy, Starrysea996, SUN Dong, Tao Luo, Tian, tianhaodongbd, tianshuo78520a, umiswing, waliwali777, wanghuancoder, Wenhao.Dai, wyw, XiaoguangHu, xiaoguoguo626807, xingmingyyj, Yichen Zhang, Yohanna, yongqiangma, Yuan Xiaolan, YUNSHEN XIE, Yuntao Nie, Yuqiang Ge, Yutian Rao, Zero Rains, Zhan Rongrui, Zhang Ting, zhanghonggeng, Zhaowu Pan, zhengshengning, ZhenxingLi, Zhou Xin, zhupengyang, zhwesky2010, Zichao, zty-king, Zx, zyfncg, zzm, 周周周, 正在学习, 苍天荒