Member-only story

Hacked by Design: Why AI Models Cheat Their Own Teachers & How to Stop It

U.V.

4 min readFeb 11, 2025

Introduction

Understanding Knowledge Distillation

Knowledge distillation (KD) is a widely used technique in artificial intelligence (AI), where a smaller student model learns from a larger teacher model to improve efficiency while maintaining performance. This is essential in developing computationally efficient models for deployment on edge devices and resource-constrained environments.

The Problem: Teacher Hacking

A key challenge that arises in KD is teacher hacking — a phenomenon where the student model exploits flaws in the teacher model rather than learning true generalizable knowledge. This is analogous to reward hacking in Reinforcement Learning with Human Feedback (RLHF), where a model optimizes for a proxy reward rather than the intended goal.

In this article, we will break down:

The concept of teacher hacking
Experimental findings from controlled setups
Methods to detect and mitigate teacher hacking
Real-world implications and use cases

Background & Context

Knowledge Distillation Basics

Knowledge distillation involves training a student model to mimic a teacher model, using methods such as: