Performs multi-headed attention from from_tensor
to to_tensor
.
This is an implementation of multi-headed attention based on "Attention is
all you Need". If from_tensor
and to_tensor
are the same, then
this is self-attention. Each timestep in from_tensor
attends to the
corresponding sequence in to_tensor
, and returns a fixed-with vector.
This function first projects from_tensor
into a "query" tensor and
to_tensor
into "key" and "value" tensors. These are (effectively) a
list of tensors of length num_attention_heads
, where each tensor is of
shape [batch_size, seq_length, size_per_head]
. Then, the query and key
tensors are dot-producted and scaled. These are softmaxed to obtain attention
probabilities. The value tensors are then interpolated by these
probabilities, then concatenated back to a single tensor and returned.
attention_layer(
from_tensor,
to_tensor,
attention_mask = NULL,
num_attention_heads = 1L,
size_per_head = 512L,
query_act = NULL,
key_act = NULL,
value_act = NULL,
attention_probs_dropout_prob = 0,
initializer_range = 0.02,
do_return_2d_tensor = FALSE,
batch_size = NULL,
from_seq_length = NULL,
to_seq_length = NULL
)
Float Tensor of shape [batch_size, from_seq_length,
from_width]
.
Float Tensor of shape [batch_size, to_seq_length,
to_width]
.
(optional) Integer Tensor of shape [batch_size,
from_seq_length, to_seq_length]
. The values should be 1 or 0. The
attention scores will effectively be set to -infinity for any positions in
the mask that are 0, and will be unchanged for positions that are 1.
Integer; number of attention heads.
Integer; size of each attention head.
(Optional) Activation function for the query transform.
(Optional) Activation function for the key transform.
(Optional) Activation function for the value transform.
(Optional) Numeric; dropout probability of the attention probabilities.
Numeric; range of the weight initializer.
Logical. If TRUE, the output will be of shape
[batch_size * from_seq_length, num_attention_heads * size_per_head]
.
If false, the output will be of shape [batch_size, from_seq_length,
num_attention_heads * size_per_head]
.
(Optional) Integer; if the input is 2D, this might (sic) be
the batch size of the 3D version of the from_tensor
and
to_tensor
.
(Optional) Integer; if the input is 2D, this might be
the seq length of the 3D version of the from_tensor
.
(Optional) Integer; if the input is 2D, this might be
the seq length of the 3D version of the to_tensor
.
float Tensor of shape [batch_size, from_seq_length,
num_attention_heads * size_per_head]
. If do_return_2d_tensor
is
TRUE, it will be flattened to shape [batch_size * from_seq_length,
num_attention_heads * size_per_head]
.
In practice, the multi-headed attention are done with transposes and reshapes rather than actual separate tensors.
# NOT RUN {
# Maybe add examples later. For now, this is only called from
# within transformer_model(), so refer to that function.
# }
Run the code above in your browser using DataLab