scaled_dot_product_attention

paddle.nn.functional. scaled_dot_product_attention ( query: Tensor, key: Tensor, value: Tensor, attn_mask: Tensor | None = None, dropout_p: float = 0.0, is_causal: bool = False, training: bool = True, backend: str | None = None, scale: float | None = None, enable_gqa: bool = True, name: str | None = None ) → Tensor [source]

The equation is:

\[result=softmax(\frac{ Q * K^T }{\sqrt{d}}) * V\]

where : Q, K, and V represent the three input parameters of the attention module. The dimensions of the three parameters are the same. d represents the size of the last dimension of the three parameters.

Warning

This API only verifies inputs with dtype float16 and bfloat16, other dtypes may fall back to math: implementation, which is less optimized.

Warning

If is_causal is set to True, the causal mask should not be provided, otherwise: the provided mask will be ignored.

Note

This API differs from scaled_dot_product_attention in that:

The QKV layout of this API is [batch_size, seq_len, num_heads, head_dim] or [seq_len, num_heads, head_dim].

System Message: WARNING/2 (/usr/local/lib/python3.10/site-packages/paddle/nn/functional/sdpa.py:docstring of paddle.nn.functional.sdpa.scaled_dot_product_attention, line 20)

Definition list ends without a blank line; unexpected unindent.

If you need num_heads before seq_len layout, please use paddle.compat.nn.functional.scaled_dot_product_attention.

Parameters

query (Tensor) – The query tensor in the Attention module. 4-D tensor with shape: [batch_size, seq_len_key, num_heads, head_dim]. 3-D tensor with shape: [seq_len_key, num_heads, head_dim]. The dtype can be float16 or bfloat16.
key (Tensor) – The key tensor in the Attention module. 4-D tensor with shape: [batch_size, seq_len_key, num_heads, head_dim]. 3-D tensor with shape: [seq_len_key, num_heads, head_dim]. The dtype can be float16 or bfloat16.
value (Tensor) – The value tensor in the Attention module. 4-D tensor with shape: [batch_size, seq_len_value, num_heads, head_dim]. 3-D tensor with shape: [seq_len_value, num_heads, head_dim]. The dtype can be float16 or bfloat16.
attn_mask (Tensor, optional) – The attention mask tensor. The shape should be broadcastable to [batch_size, num_heads, seq_len_key, seq_len_query]. The dtype can be bool or same type of query. The bool mask indicates the positions should take part in attention. The non-bool mask will be added to attention score.
dropout_p (float, optional) – The dropout ratio.
is_causal (bool, optional) – Whether enable causal mode.
training (bool, optional) – Whether it is in the training phase.
backend (str, optional) – Specify which backend to compute scaled dot product attention. Currently only support “p2p” for distribution usage.
scale (float, optional) – The scaling factor used in the calculation of attention weights. If None, scale = 1 / sqrt(head_dim).
enable_gqa (bool, optional) – Whether enable GQA(Group Query Attention) mode. Default is True.
name (str|None, optional) – The default value is None. Normally there is no need for user to set this property. For more information, please refer to api_guide_Name.

Returns

The attention tensor.: 4-D tensor with shape: [batch_size, seq_len, num_heads, head_dim]. 3-D tensor with shape: [seq_len, num_heads, head_dim]. The dtype can be float16 or bfloat16.

Return type

out(Tensor)

Examples

           >>> 
>>> import paddle
>>> q = paddle.rand((1, 128, 2, 16), dtype=paddle.bfloat16)
>>> output = paddle.nn.functional.scaled_dot_product_attention(q, q, q, None, 0.9, False)
>>> print(output)
>>>