scaled_dot_product_attention

paddle.nn.functional. scaled_dot_product_attention ( query: Tensor, key: Tensor, value: Tensor, attn_mask: Tensor | None = None, dropout_p: float = 0.0, is_causal: bool = False, training: bool = True, backend: str | None = None, scale: float | None = None, enable_gqa: bool = True, name: str | None = None ) Tensor [source]

The equation is:

\[result=softmax(\frac{ Q * K^T }{\sqrt{d}}) * V\]

where : Q, K, and V represent the three input parameters of the attention module. The dimensions of the three parameters are the same. d represents the size of the last dimension of the three parameters.

Warning

This API only verifies inputs with dtype float16 and bfloat16, other dtypes may fall back to math

implementation, which is less optimized.

Warning

If is_causal is set to True, the causal mask should not be provided, otherwise

the provided mask will be ignored.

Note

This API differs from scaled_dot_product_attention in that:
  1. The QKV layout of this API is [batch_size, seq_len, num_heads, head_dim] or [seq_len, num_heads, head_dim].

System Message: WARNING/2 (/usr/local/lib/python3.10/site-packages/paddle/nn/functional/sdpa.py:docstring of paddle.nn.functional.sdpa.scaled_dot_product_attention, line 20)

Definition list ends without a blank line; unexpected unindent.

If you need num_heads before seq_len layout, please use paddle.compat.nn.functional.scaled_dot_product_attention.

Parameters
  • query (Tensor) – The query tensor in the Attention module. 4-D tensor with shape: [batch_size, seq_len_key, num_heads, head_dim]. 3-D tensor with shape: [seq_len_key, num_heads, head_dim]. The dtype can be float16 or bfloat16.

  • key (Tensor) – The key tensor in the Attention module. 4-D tensor with shape: [batch_size, seq_len_key, num_heads, head_dim]. 3-D tensor with shape: [seq_len_key, num_heads, head_dim]. The dtype can be float16 or bfloat16.

  • value (Tensor) – The value tensor in the Attention module. 4-D tensor with shape: [batch_size, seq_len_value, num_heads, head_dim]. 3-D tensor with shape: [seq_len_value, num_heads, head_dim]. The dtype can be float16 or bfloat16.

  • attn_mask (Tensor, optional) – The attention mask tensor. The shape should be broadcastable to [batch_size, num_heads, seq_len_key, seq_len_query]. The dtype can be bool or same type of query. The bool mask indicates the positions should take part in attention. The non-bool mask will be added to attention score.

  • dropout_p (float, optional) – The dropout ratio.

  • is_causal (bool, optional) – Whether enable causal mode.

  • training (bool, optional) – Whether it is in the training phase.

  • backend (str, optional) – Specify which backend to compute scaled dot product attention. Currently only support “p2p” for distribution usage.

  • scale (float, optional) – The scaling factor used in the calculation of attention weights. If None, scale = 1 / sqrt(head_dim).

  • enable_gqa (bool, optional) – Whether enable GQA(Group Query Attention) mode. Default is True.

  • name (str|None, optional) – The default value is None. Normally there is no need for user to set this property. For more information, please refer to api_guide_Name.

Returns

The attention tensor.

4-D tensor with shape: [batch_size, seq_len, num_heads, head_dim]. 3-D tensor with shape: [seq_len, num_heads, head_dim]. The dtype can be float16 or bfloat16.

Return type

out(Tensor)

Examples

>>> 
>>> import paddle
>>> q = paddle.rand((1, 128, 2, 16), dtype=paddle.bfloat16)
>>> output = paddle.nn.functional.scaled_dot_product_attention(q, q, q, None, 0.9, False)
>>> print(output)
>>>