moe_reduce¶
- paddle.incubate.nn.functional. moe_reduce ( ffn_out: Tensor, expert_scales_float: Tensor, permute_indices_per_token: Tensor, top_k_indices: Tensor, ffn2_bias: Tensor | None = None, norm_topk_prob: bool = False, routed_scaling_factor: float = 1.0 ) Tensor [source]
-
Reduces the outputs from experts back to the original token order.
This function gathers the outputs from different experts and combines them according to the original token positions. It also applies scaling factors to the outputs.
- Parameters
-
ffn_out (Tensor) – The output tensor from experts’ FFN computation, with shape [total_tokens, d_model].
expert_scales_float (Tensor) – The scaling factors for each expert’s outputs, with shape [batch_size * seq_len, moe_topk, 1, 1].
permute_indices_per_token (Tensor) – The index mapping from expert outputs to original token positions.
top_k_indices (Tensor) – The indices of the selected experts for each token.
ffn2_bias (Optional[Tensor]) – The biases for the second FFN layer, with shape [num_experts, 1, d_model].
norm_topk_prob (bool) – Whether to normalize the top-k probabilities.
routed_scaling_factor (float) – Whether to refactor probabilities.
- Returns
-
The final output tensor with shape [batch_size * seq_len, d_model].
- Return type
-
Tensor
Examples
>>> import paddle >>> from paddle.incubate.nn.functional import moe_reduce >>> ffn_out = paddle.randn([7680, 768]) # 7680 = bs * 128 * 6 >>> ffn2_bias = paddle.randn([48, 1, 768]) >>> expert_scales_float = paddle.rand([1280, 6, 1, 1]) >>> permute_indices_per_token = paddle.to_tensor([6, 1280], dtype='int32') >>> top_k_indices = paddle.to_tensor([1280, 6], dtype='int32') >>> norm_topk_prob = False >>> output = moe_reduce( ... ffn_out, ... expert_scales_float, ... permute_indices_per_token, ... top_k_indices, ... ffn2_bias, ... norm_topk_prob, ... ) >>> print(output.shape) # 输出: [1280, 768]