attention_layer: Build multi-headed attention layer

Description

Performs multi-headed attention from from_tensor to to_tensor. This is an implementation of multi-headed attention based on "Attention is all you Need". If from_tensor and to_tensor are the same, then this is self-attention. Each timestep in from_tensor attends to the corresponding sequence in to_tensor, and returns a fixed-with vector. This function first projects from_tensor into a "query" tensor and to_tensor into "key" and "value" tensors. These are (effectively) a list of tensors of length num_attention_heads, where each tensor is of shape [batch_size, seq_length, size_per_head]. Then, the query and key tensors are dot-producted and scaled. These are softmaxed to obtain attention probabilities. The value tensors are then interpolated by these probabilities, then concatenated back to a single tensor and returned.

Usage

attention_layer(
  from_tensor,
  to_tensor,
  attention_mask = NULL,
  num_attention_heads = 1L,
  size_per_head = 512L,
  query_act = NULL,
  key_act = NULL,
  value_act = NULL,
  attention_probs_dropout_prob = 0,
  initializer_range = 0.02,
  do_return_2d_tensor = FALSE,
  batch_size = NULL,
  from_seq_length = NULL,
  to_seq_length = NULL
)

Arguments

from_tensor

Float Tensor of shape [batch_size, from_seq_length, from_width].

to_tensor

Float Tensor of shape [batch_size, to_seq_length, to_width].

attention_mask

(optional) Integer Tensor of shape [batch_size, from_seq_length, to_seq_length]. The values should be 1 or 0. The attention scores will effectively be set to -infinity for any positions in the mask that are 0, and will be unchanged for positions that are 1.

num_attention_heads

Integer; number of attention heads.

size_per_head

Integer; size of each attention head.

query_act

(Optional) Activation function for the query transform.

key_act

(Optional) Activation function for the key transform.

value_act

(Optional) Activation function for the value transform.

attention_probs_dropout_prob

(Optional) Numeric; dropout probability of the attention probabilities.

initializer_range

Numeric; range of the weight initializer.

do_return_2d_tensor

Logical. If TRUE, the output will be of shape [batch_size * from_seq_length, num_attention_heads * size_per_head]. If false, the output will be of shape [batch_size, from_seq_length, num_attention_heads * size_per_head].

batch_size

(Optional) Integer; if the input is 2D, this might (sic) be the batch size of the 3D version of the from_tensor and to_tensor.

from_seq_length

(Optional) Integer; if the input is 2D, this might be the seq length of the 3D version of the from_tensor.

to_seq_length

(Optional) Integer; if the input is 2D, this might be the seq length of the 3D version of the to_tensor.

Value

float Tensor of shape [batch_size, from_seq_length, num_attention_heads * size_per_head]. If do_return_2d_tensor is TRUE, it will be flattened to shape [batch_size * from_seq_length, num_attention_heads * size_per_head].

Details

In practice, the multi-headed attention are done with transposes and reshapes rather than actual separate tensors.

Examples

Run this code

# NOT RUN {
# Maybe add examples later. For now, this is only called from
# within transformer_model(), so refer to that function.
# }

Run the code above in your browser using DataLab