3.1 Release Note
The PaddlePaddle framework version 3.2 has further enhanced its performance in large model training and inference, hardware adaptation, and support for mainstream large models and high-performance acceleration libraries.
In terms of large model training, the PaddlePaddle framework has undergone upgrades in three aspects: computation, parallel strategy, and fault tolerance:
From the perspective of basic computational performance, FlashMask V3, a sparse mask attention computation with overlapping storage and computation, is proposed to maximize the computational efficiency of Attention. Additionally, it also implements an efficient lossless training technique with FP8 mixed precision effect.
At the level of distributed parallel strategy, a dynamically adaptive VRAM offloading strategy is proposed to achieve optimal balance between memory and computation. Combined with an innovatively designed VRAM-friendly pipeline parallel scheduling, it further reduces VRAM overhead.
Enhanced the native fault tolerance capability of the framework, implemented a large-scale cluster training fault tolerance system, which can monitor online silent data corruption and other difficult-to-detect faults without affecting training efficiency, and implemented a highly available checkpoint disaster recovery method to reduce the loss of interruption recovery.
In terms of hardware adaptation, we have comprehensively upgraded the plug-in adaptation solution for CUDA-like chips. In terms of device resource management and scheduling, as well as high-performance collective communication libraries, management interface upgrades and communication capability enhancements have been made for CUDA-like chips, with a particular emphasis on enhancing distributed communication capabilities, aligning XCCL with the various structures and functions of NCCL.
Added a registration mechanism for CUDA-like operators. Taking Muxi adaptation as an example, operator kernel registration can be completed with just one line of code based on the reuse of GPU operator kernels. According to statistical calculations, the reuse rate of operator kernels can reach up to 92%, significantly reducing hardware adaptation costs. In terms of user experience, the focus has been placed on enhancing compatibility, encompassing the development of interfaces compatible with industry practices, compatibility with the SafeTensors model format, and compatibility with third-party high-performance acceleration libraries.
The newly added and modified development interfaces are compatible with industry practices, with a series of new APIs and aliases introduced, along with new parameter aliases and both proprietary and generic parameters.
Fully compatible with the Safetensors model format. The newly added FlexCheckpoint mechanism supports automatic parameter re-sharding across distributed strategies and model structures, significantly reducing the cost of weight conversion and thereby enhancing the end-to-end training and inference development efficiency of large models.
The system has systematically enhanced its interface compatibility and operator registration capabilities, enabling one-click import of high-performance acceleration libraries. These libraries can be directly reused in PaddlePaddle's model training and inference acceleration processes without requiring code modifications.
1. user experience
New features
New APIs:
paddle.msort
,paddle.ravel
,paddle.nn.functional.dropout1d
,paddle.Tensor.type_as
,paddle.Tensor.requires_grad
,paddle.view_as_complex
,paddle.view_as_real
,paddle.nn.Parameter
,paddle.broadcast_shapes
,paddle.range
,paddle.as_tensor
,paddle.scatter_reduce/scatter_reduce_
,paddle.scatter_add
,paddle.tensor
,paddle.softmax
,paddle.Tensor.softmax
,paddle.rand_like
,paddle.is_autocast_enabled
,paddle.get_autocast_gpu_dtype
,paddle.Tensor.repeat
,paddle.permute
. #74421,#74439,#74444,#74454,#74459,[#74491]( https://github.com/PaddlePaddle/Paddle/pull/74491 )[# 74466]( https://github.com/PaddlePaddle/Paddle/pull/74466 ),#74438,#74594,#74542,#74694,#74564,#74540,#74586,#74651,#74807,#74632,#74834,#74952,#74772,#74441,#74561,#74525Added a series of APIs under
paddle.compat.*
to support common usage in the industry and facilitate code migration, includingpaddle.compat.median
,paddle.compat.nanmedian
,paddle.compat.softmax
,paddle.compat.sort
,paddle.compat.split
,paddle.compat.min/max
, andpaddle.compat.Unfold
. #74865, #74874Added a series of initialization APIs to support commonly used parameter initialization methods in the industry, including
paddle.nn.init.kaiming_uniform_
,paddle.nn.init.xavier_uniform_
,paddle.nn.init.uniform_
,paddle.nn.init.kaiming_normal_
,paddle.nn.init.xavier_normal_
,paddle.nn.init.normal_
,paddle.nn.init.calculate_gain
,paddle.nn.init.constant_
,paddle.nn.init.dirac_
,paddle.nn.init.eye_
,paddle.nn.init.ones_
,paddle.nn.init.orthogonal_
,paddle.nn.init.trunc_normal_
, andpaddle.nn.init.zeros_
. #74478Added usage of parameter aliases in API, allowing for more flexible input options such as
x
orinput
. This includes functions likepaddle.maximum
,paddle.minimum
,paddle.sqrt
,paddle.topk
,paddle.polar
,paddle.stack
,paddle.cos
,paddle.floor
,paddle.log
,paddle.pow
,paddle.rsqrt
,paddle.sign
,paddle.sin
,paddle.multiply
, andpaddle.where
. #74683, #74795, #74887, #74592paddle.Tensor
now supports multiple initialization methods, enabling flexible Tensor creation. #74619, #75022, #75065The API has added some proprietary parameters to enhance existing functions, including
paddle.nn.functional.gelu
,paddle.divide/div/div_
,paddle.add
,paddle.Tensor.copy_
,paddle.norm
,paddle.linalg.norm
,paddle.nn.functional.silu
, andpaddle.repeat_interleave
. #74485, #74562, #74420, #74768, #74855, #74903, #74788, #74631, #74947The API has added some common parameters:
out
,device
,dtype
,requires_grad
,pin_memory
, andbias
, enhancing the existing functionality. These includepaddle.zeros
,paddle.zeros_like
,paddle.ones
,paddle.ones_like
,paddle.arange
,paddle.eye
,paddle.empty
,paddle.empty_like
,paddle.full
,paddle.full_like
,paddle.randn
,paddle.Tensor.new_full
,paddle.Tensor.new_empty
,paddle.Tensor.new_ones
,paddle.Tensor.new_zeros
,paddle.tril/triu
,paddle.bmm
,paddle.nn.Conv1D/Conv2D/Conv3D/Embedding
,paddle.diff
,paddle.cumsum
,paddle.var
,paddle.multinomial
, andpaddle.mean
. #74477,#74526,#74711,#74582,#74624,#74849,#74612,#74875,#74641,#74949,#74918,#74914,#74934,#74920,#74955,#74226,#74946Added aliases to APIs to support more calling methods. These include
paddle.Tensor.mul_/mul
,paddle.autograd.Function
,paddle.argwhere
,paddle.cat
,paddle.clamp
,paddle.ger
,paddle.take_along_dim
,paddle.linalg.matmul
,paddle.special.logsumexp
,paddle.concatenate
,paddle.eq/gt
,paddle.Tensor.take_along_dim
, andpaddle.nn.Conv1d/Conv2d/Conv3d
, etc. #74493, #74569, #74870
Bug fixes
Enhanced functionality
Other
Optimization related to code style. #74654,#74655,#74665,#74660,#74667,#74664,#74662,#74661,#74658,#74657,#74666,#74659,#74663,#74656,#74673,#74672,#74671,#74674,#74675,#74670,#74669,#74677,#74709,#74714,#74712,#74713,#74704,#74746,#74748,#74743,#74742,#74744,#74745,#74747,#74794,#74789,#74793,#74786,#74791,#74787,#74827,#74608,#74288,#74287,#74385,#74395,#74475,#74647
Optimization related to MKLDNN/ONEDNN. #74299,#74244,#74230,#74314,#74327,#74325,#74326,#74315,#74399,#74398,#74393,#74392,#74367,#74391,#74423,#74424,#74436,#74417,#74410,#74473,#74458,#74501,#74487,#74502,#74513,#74518,#74516,#74507,#74504,#74505,#74509,#74535,#74536,#74517,#74503,#74557,#74550,#74575,#74587,#74576,#74588,#74549,#74581,#74583,#74628,#74630,#74635,#74679,#74648,#74127,#74636,#74552,#74551,#74678,#74680,#74730,#74751,#74895,#74821,#74897,#74734
Optimizations related to code implementation, variable and file renaming. #74309, #74597, #74613, #74376, #74479, #74960, #74968, #74977
Optimizations related to unit tests, and bug fixes for unit test issues. #74595
Compilation-related optimizations and CI issue fixes. #74356, #74936
Optimize debugging and printing information, and optimize error reporting information. #74765, #74381, #74384, #74386, #74387, #74383, #74519, #74520, #74468
Optimizations related to custom operators. #74402
Distributed FlexCheckpoint support. #74966, #74593, #74785, #74814
2. Basic execution architecture
New features
Support for dynamic graphs. #74484
Added offloader to optimize computation efficiency. #74837
Added API support for forward computation of conv_transpose. #74431
Added offloader to optimize computation efficiency. #74837
The inference deployment has added w4afp8 quantization inference, supporting w4afp8 quantization weight pure permutation and all2all communication #74270
Bug fixes
Core framework and infrastructure optimization. #74336, #74554, #74634
Calculation accuracy and type handling. #74278, #74222, #74830
Optimization of dynamic dimension check logic. #74633, #74650
Fixed printing of error/warning messages. #74474, #74533, #74685, #74721, #74754
Fixed the processing logic of the flashmask API. #74928
Fixed the issue where splitting CudaGraph subgraphs did not take effect in dynamic-to-static mode. (#74749)
Enhanced functionality
Deprecated
3. Distributed & automatic parallelism
Parallel strategy
In version 3.2, we have made multiple enhancements to the pipeline parallelism feature, including implementing support for dictionary parameter passing and extending the compatibility of Pipeline Layer and SharedLayerDesc with non-pipeline parallelism. Additionally, we have fixed several critical issues, such as IPC API exceptions for large-sized tensors, evaluation batches and non-computational losses in pipeline parallelism, gradient release errors in MoE models, hang issues caused by NCCL communication reconstruction in PP scenarios, and event management errors in dual-pipeline parallelism. Furthermore, we have conducted various performance optimizations, improved the computation overlap efficiency of dual-pipeline parallelism to enhance training performance, and upgraded the clear_param_storage method to support the clearing and resetting operations of multiple color collections in sharding mode.
New Features
Bug fixes
Fixed the IPC API issue with large-sized tensors. #74472
Fixed issues related to evaluation batch and non-compute_loss in pipeline parallelism. #74170
Fixed the gradient release issue on MoE model. #74972
Fixed the hang issue when rebuilding NCCL comm in the pp scenario. #73625
Fixed the event management error in dual pipeline parallelism (dual pp). #74158
Automatic parallelism
Functional improvements
Support the default splitting derivation rule for the same dimension of distributed tensors when it is split by multiple mesh dimensions. #74396
Improved the slicing derivation rule of the
reshape
operator to support scenarios where the same dimension of a distributed tensor is sliced by multiple mesh dimensions. #74352, #74579, #74565Support changing the mesh of a tensor without altering the distributed tensor data. #74248
Bug fixes
Fixed the bug of repeatedly creating communication groups when calling the
get_group
method ofProcessMesh
. #73099Fixed the bug in the
get_local_slices
method in the MoE scenario. #74705Fixed the bug of gradient clipping in the MoE scenario. #74916
Fixed the bug where the
stop_gradient
parameter could not be passed between different stages in the pipeline parallel scenario. #73459Fixed the accuracy bug of gradient clipping in parallel pipeline scenarios. #74409
Fixed the bug of generating redundant outputs in the dynamic graph pipeline parallel scenario. #74913
Fixed the bug that the operators
moe_combine
andmoe_gate_dispatch
did not work in the MoE scenario. #74645
Communication Library
In version 3.2, we fixed an error in DeepEP's support for sm90 compilation, added a pre-allocation function to the video memory allocation requested by DeepEP, and upgraded its intranode and internode computation kernels, further optimizing performance and stability.
4. Operator mechanism
New features
Bug fixes
Major Tensor-related fixes. #74242, #74293, #74289, #74279, #74330, #74329, #74342, #74369, #74370, #74404, #74537, #74451, #74172, #74324, #74964, #74360, #74379, #74377, #74380, #74362, #74197
[Open Source Task] Investigate and resolve precision issues in Paddle CPU/GPU Kernels. #74149, #74598, #74719, #74625, #74555
Other important fixes. #74282, #74313, #74303, #74306, #74298, #74044, #74290, #74348, #74364, #74332, #74224, #74382, #74406, #74434, #74448, #74457, #74322, #74530, #74716, #74839, #74842, #74854, #74919, #74767, #75003
Enhanced functionality
Improved API compatibility. #74456, #74480, #74523, #74490, #74548, #74596, #74568, #74559, #74629, #74623, #74700, #74643, #74602, #74783, #74781, #74735, #74725, #74815, #74856, #74925, #74545, #74932, #74784
Slice/stride related optimizations. #74731, #74740, #74769, #74810, #74841, #74954, #74888, #74944, #74312, #74291, #74271, #74320, #74344, #74727, #74637
Operator optimization and CUDA support. #74693, #74922, #74967
Improved debugging information and compatibility enhancements. #74372, #74622
Operator function expansion and optimization. #74790, #74979
Performance optimization
5. Hardware adaptation
Improved CUDA-like hardware integration solution
Main warehouse supports multiple hardware for single testing
6. Installation environment
Bug fixes
Fixed the bug in flashattent compilation cache. #74388
Fixed the bug where site.USER_SITE was None. #74373
Fixed the compilation bug of gtest in multi-architecture Linux systems. #74723
Fixed multiple compilation errors in DEBUG mode when WITH_GPU=ON. #74401
Fixed the compilation bug of CUDA12.6 under Windows. #74990
Fixed the bug in the api-benchmark baseline pipeline. #74770
Fixed the bug in the api-benchmark baseline pipeline. #74778
Fixed the bug in the api-benchmark baseline pipeline. #74779
Fixed the bug in the api-benchmark baseline pipeline. #74780
Fixed the bug in the api-benchmark baseline pipeline. #74800
Fixed the bug in the api-benchmark baseline pipeline. #74803
Other
Disable the test_custom_contiguous unit test. #74337
Support for timed triggering of baseline tasks in the slice pipeline. #74419
Support manually specifying the pr for adding slice recording baselines. #74445
Check if there are any issues in the code. #74460
Support CI PaddleX tasks on XPU. #74426
Support slice pipeline exemption mechanism. #74482
Updated the Paddle base image. #73423
Fixed Ninja version 1.11 for Windows. #74590
Support adding the ability to close PRs and cancel CIs. #74604
Support for quickly skipping all CI. #74696
Add an api-benchmark baseline pipeline. #74690
Update the nccl version. #74809
Update the RD list for the approve pipeline. #74838
Update the RD list for the approve pipeline. #74902
Update safetensor to the mirror. #74904
Added the compilation flag for flashatten. #74959
Temporarily disable the win-inference pipeline. #74980
Support for compiling phi dynamic libraries on Windows. #74950
7. List of contributors
AIbin, Ayakouji, baiyue, baoqiwen, Chang Lu, Chen Zhiyang, co63oc, cyberslack_lee, cyy536, datutu-L, Deng Haodong, Difer, Eddie-Wang, enzodechine, fangfangssj, feri, fxyfxy777, ggggxm, GoldPancake, gouzil, Gu Shiwei, Haze188 灏喆, hohdiy, hong, HU Shenwei, huangjiyi, HydrogenSulfate, kjagsdq, LCStayingdullCircuit, Leo Guo, lightbrother, liufengwei0103, liuruyan, LiYuRio, LLSGYN, Lucas, Luckycheng222, lzy, Nana, Nyakku Shigure, ooo oo, Qianyue He, risemeup1, Ruibiao Chen, Ryan, Shuhao Liang, sneaxiy, Starrysea996, SUN Dong, Tao Luo, Tian, tianhaodongbd, tianshuo78520a, umiswing, waliwali777, wanghuancoder, Wenhao.Dai, wyw, XiaoguangHu, xiaoguoguo626807, xingmingyyj, Yichen Zhang, Yohanna, yongqiangma, Yuan Xiaolan, YUNSHEN XIE, Yuntao Nie, Yuqiang Ge, Yutian Rao, Zero Rains, Zhan Rongrui, Zhang Ting, zhanghonggeng, Zhaowu Pan, zhengshengning, ZhenxingLi, Zhou Xin, zhupengyang, zhwesky2010, Zichao, zty-king, Zx, zyfncg, zzm, 周周周, 正在学习, 苍天荒