Model Distillation
Model Distillation is a model compression technique where a large, high-performing model (the 'Teacher' model) is used to train a smaller, simpler model (the 'Student' model). Instead of training the Student model only on the ground-truth labels, it is also trained to mimic the output probabilities (the 'soft targets') generated by the Teacher model.
In modern AI, state-of-the-art models are often massive, requiring significant computational resources (high latency, large memory footprint). This makes deployment challenging on resource-constrained devices like mobile phones, IoT sensors, or in real-time edge computing environments. Distillation allows organizations to retain much of the Teacher's complex knowledge while drastically reducing the Student's size and inference time.
The core mechanism involves transferring 'dark knowledge.' The Teacher model produces not just a hard prediction (e.g., 'Cat'), but a probability distribution over all possible classes (e.g., 90% Cat, 8% Dog, 2% Bird). This distribution contains nuanced information about the model's uncertainty and relationships between classes. The Student model is then trained using a combined loss function: one component minimizes the difference between its predictions and the true labels (hard targets), and a second component minimizes the difference between its predictions and the Teacher's soft targets.