Transformers Components Apr 2026
: The mechanism uses learned weight matrices ( ) to project input vectors into three spaces. Query ( ) : What the token is looking for. Key ( ) : What the token contains. Value ( ) : The information the token provides once matched.
: These convert discrete tokens (words or characters) into fixed-size vectors that capture initial semantic meaning.
: Normalizes the vector features to keep activations at a consistent scale, preventing vanishing or exploding gradients. transformers components
Following the attention layers, each position in the encoder and decoder is processed by a .
: Calculates a "relevance score" between tokens, allowing the model to understand how much focus one word should have on another (e.g., relating "he" to "Tom"). : The mechanism uses learned weight matrices (
In the final stage of the decoder, the output vectors are transformed into human-readable results.
: Vectors are added to the embeddings to provide information about the relative or absolute position of each token in the sequence. 2. The Multi-Head Attention Mechanism Value ( ) : The information the token provides once matched
This consists of two linear transformations with a non-linear activation (typically ReLU) in between.
