LLM Architecture

Understanding what happens inside the model helps you debug unexpected behavior.

The Transformer

Introduced in 2017 ("Attention is All You Need").

Encoders (BERT): Good at understanding context and classification.
Decoders (GPT): Good at generating text (next-token prediction).
Encoder-Decoder (T5, BART): Good at translation and summarization.

Attention Mechanism

Self-attention allows the model to weigh the importance of every token in the context window relative to the current token being processed.

Sparse MoE (Mixture of Experts)

Models like Mixtral 8x7B don't use all parameters for every token. They route tokens to specialized "expert" sub-networks, making inference much faster while maintaining high capability.

The Transformer​

Attention Mechanism​

Sparse MoE (Mixture of Experts)​

The Transformer

Attention Mechanism

Sparse MoE (Mixture of Experts)