Allows the model to jointly attend to information from different representation subspaces. See reference: Attention Is All You Need
nnf_multi_head_attention_forward(
query,
key,
value,
embed_dim_to_check,
num_heads,
in_proj_weight,
in_proj_bias,
bias_k,
bias_v,
add_zero_attn,
dropout_p,
out_proj_weight,
out_proj_bias,
training = TRUE,
key_padding_mask = NULL,
need_weights = TRUE,
attn_mask = NULL,
avg_weights = TRUE,
use_separate_proj_weight = FALSE,
q_proj_weight = NULL,
k_proj_weight = NULL,
v_proj_weight = NULL,
static_k = NULL,
static_v = NULL,
batch_first = FALSE
)
\((L, N, E)\) where L is the target sequence length, N is the batch size, E is the embedding dimension. If batch_first is TRUE, the first two dimensions are transposed.
\((S, N, E)\), where S is the source sequence length, N is the batch size, E is the embedding dimension. If batch_first is TRUE, the first two dimensions are transposed.
\((S, N, E)\) where S is the source sequence length, N is the batch size, E is the embedding dimension. If batch_first is TRUE, the first two dimensions are transposed.
total dimension of the model.
parallel attention heads.
input projection weight and bias.
currently undocumented.
bias of the key and value sequences to be added at dim=0.
currently undocumented.
add a new batch of zeros to the key and value sequences at dim=1.
probability of an element to be zeroed.
the output projection weight and bias.
currently undocumented.
apply dropout if is TRUE
.
\((N, S)\) where N is the batch size, S is the source sequence length.
If a ByteTensor is provided, the non-zero positions will be ignored while the position
with the zero positions will be unchanged. If a BoolTensor is provided, the positions with the
value of True
will be ignored while the position with the value of False
will be unchanged.
output attn_output_weights.
2D mask \((L, S)\) where L is the target sequence length, S is the source sequence length.
3D mask \((N*num_heads, L, S)\) where N is the batch size, L is the target sequence length,
S is the source sequence length. attn_mask ensure that position i is allowed to attend the unmasked
positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend
while the zero positions will be unchanged. If a BoolTensor is provided, positions with True
is not allowed to attend while False
values will be unchanged. If a FloatTensor
is provided, it will be added to the attention weight.
Logical; whether to average attn_output_weights over the attention heads before outputting them. This doesn't change the returned value of attn_output; it only affects the returned attention weight matrix.
the function accept the proj. weights for query, key, and value in different forms. If false, in_proj_weight will be used, which is a combination of q_proj_weight, k_proj_weight, v_proj_weight.
input projection weight and bias.
currently undocumented.
currently undocumented.
static key and value used for attention operators.
currently undocumented.
Logical; whether to expect query, key, and value to have batch as their first parameter, and to return output with batch first.