Knowledge Distillation in LLMs

A Jensen-Shannon divergence based distillation framework focused on stability and stronger compression.

F1 0.9125

12.5% variance reduction

20% faster convergence

Overview

This project explores how Jensen-Shannon divergence can outperform KL divergence for LLM distillation through smoother optimization, stronger dark knowledge transfer, and better downstream metrics.

Language DNA

Python

This project is centered on experimentation and model behavior, so the case study stays research-forward with Python and PyTorch as the dominant technical layer.

PyTorch

Research-driven model compression

The project studies knowledge distillation through a research-first lens, with Jensen-Shannon divergence used as the central optimization idea.

It treats Jensen-Shannon divergence as an alternative to KL divergence for LLM distillation, focusing on optimization stability and downstream quality rather than only compression ratio.

The implementation uses PyTorch to distill a smaller student from a larger SmolLM2 teacher and track F1, ROUGE, convergence speed, and final loss.

Research focus

Tests a more symmetric divergence objective for more stable knowledge transfer.

Tracks empirical metrics rather than stopping at theoretical motivation.

Treats distillation quality as a function of optimization behavior and dark knowledge preservation.

System design

Its value comes from measured outcomes: stronger F1, better ROUGE, lower variance, and faster convergence than the comparison setup.

Product capabilities

Distilled SmolLM2-135M down to 90M parameters using Jensen-Shannon divergence.

Observed stronger optimization stability than KL divergence baselines.

Improved F1, ROUGE, convergence speed, and final loss compared with KL-based distillation.

Workflow

Experimental flow

Train teacher-student distillation with the Jensen-Shannon objective.

Compare training dynamics with KL-based baselines.

Evaluate student quality through downstream metrics and convergence behavior.

Execution model

The project reads best as an empirical compression study focused on stability, convergence, and quality preservation.

Actions

View Repository Get in Touch

Case study

Research angle

The project focuses on one targeted research question: whether Jensen-Shannon divergence can produce a more stable and better-performing distillation process than the conventional KL-based setup. Rather than making broad compression claims, the work is framed around measurable behavior such as convergence quality, variance, and downstream task performance. That keeps the case study grounded in model behavior rather than abstraction alone.

Key takeaway

A clear takeaway from the project is that compression quality depends on more than model size reduction. Training stability and the preservation of dark knowledge have a visible effect on how well the student model retains useful behavior. That makes the choice of distillation objective a product-level decision in any workflow where small models still need to perform credibly.