DeepSeek V3 Bullet Points

This blog is about the bullet points I summarized after learning the technical report and source codes of DeepSeek V3. It is a really amazing and resourceful process.

Network

Multi-head Latent Attention (MLA)

Attention scaling

Mixture of Expert (MoE)

Load balancing

SwiGLU for linear layer

Network params

Infrastructure

Parallelism

Mixed Precision

Below is an example of mixed precision training applied to Linear operator. Some keypoints are:

How to handle overflow and underflow

Pre-training

Tokenizer

Training data

Fill-in-Middle (FIM) strategy

Learning rate scheduling

Long context extension

Post-training

More post-training techniques are included in DeepSeek-R1.

Supervised finetuning

Reinforcement learning