nnf_multi_head_attention_forward: Multi head attention forward

Description

Allows the model to jointly attend to information from different representation subspaces. See reference: Attention Is All You Need

Usage

nnf_multi_head_attention_forward(
  query,
  key,
  value,
  embed_dim_to_check,
  num_heads,
  in_proj_weight,
  in_proj_bias,
  bias_k,
  bias_v,
  add_zero_attn,
  dropout_p,
  out_proj_weight,
  out_proj_bias,
  training = TRUE,
  key_padding_mask = NULL,
  need_weights = TRUE,
  attn_mask = NULL,
  use_separate_proj_weight = FALSE,
  q_proj_weight = NULL,
  k_proj_weight = NULL,
  v_proj_weight = NULL,
  static_k = NULL,
  static_v = NULL
)

Arguments

query

\((L, N, E)\) where L is the target sequence length, N is the batch size, E is the embedding dimension.

key

\((S, N, E)\), where S is the source sequence length, N is the batch size, E is the embedding dimension.

value

\((S, N, E)\) where S is the source sequence length, N is the batch size, E is the embedding dimension.

embed_dim_to_check

total dimension of the model.

num_heads

parallel attention heads.

in_proj_weight

input projection weight and bias.

in_proj_bias

currently undocumented.

bias_k

bias of the key and value sequences to be added at dim=0.

bias_v

currently undocumented.

add_zero_attn

add a new batch of zeros to the key and value sequences at dim=1.

dropout_p

probability of an element to be zeroed.

out_proj_weight

the output projection weight and bias.

out_proj_bias

currently undocumented.

training

apply dropout if is TRUE.

key_padding_mask

\((N, S)\) where N is the batch size, S is the source sequence length. If a ByteTensor is provided, the non-zero positions will be ignored while the position with the zero positions will be unchanged. If a BoolTensor is provided, the positions with the value of True will be ignored while the position with the value of False will be unchanged.

need_weights

output attn_output_weights.

attn_mask

2D mask \((L, S)\) where L is the target sequence length, S is the source sequence length. 3D mask \((N*num_heads, L, S)\) where N is the batch size, L is the target sequence length, S is the source sequence length. attn_mask ensure that position i is allowed to attend the unmasked positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged. If a BoolTensor is provided, positions with True is not allowed to attend while False values will be unchanged. If a FloatTensor is provided, it will be added to the attention weight.

use_separate_proj_weight

the function accept the proj. weights for query, key, and value in different forms. If false, in_proj_weight will be used, which is a combination of q_proj_weight, k_proj_weight, v_proj_weight.

q_proj_weight

input projection weight and bias.

k_proj_weight

currently undocumented.

v_proj_weight

currently undocumented.

static_k

static key and value used for attention operators.

static_v

currently undocumented.