Back to Projects
Sequence Modeling
2024

Neural Machine Translation

A Transformer-based Spanish-to-English translation system trained on Europarl sentence pairs.

500K sentence pairs
BLEU 12.41
16K BPE vocabulary
Overview

This project implements a full NMT pipeline with transformer architecture, subword tokenization, and beam search decoding for practical translation experiments.

Language DNA

Python

This is a modern deep-learning pipeline, so the page leans into Python's role across preprocessing, training, evaluation, and inference. The product story is really about the end-to-end translation workflow.

PyTorchspaCy
Full NMT pipeline
1

The project is built as a full Transformer-based NMT system rather than only a model script.

2

The pipeline includes linguistic normalization with spaCy, subword tokenization with SentencePiece, Transformer training, BLEU evaluation, and inference via greedy or beam search.

3

The codebase is organized into modular units for preprocessing, architecture, training, testing, and translation.

Modeling and inference

Uses canonical multi-head self-attention with an encoder-decoder Transformer structure.

Supports greedy search for speed and beam search for stronger final translation quality.

Tracks loss, perplexity, and BLEU rather than relying on a single metric.

Data and training

Applies spaCy lemmatization to normalize text before tokenization.

Uses SentencePiece BPE to reduce unknown-word issues and support open-vocabulary handling.

Supports resumable training with GPU and mixed-precision acceleration.

System design

Each pipeline stage is separated cleanly, which makes retraining, evaluation, and inference much easier to reason about.

Beam search and mixed precision both play an important role in making the system practical.

Product capabilities

Built the model around multi-head self-attention and sinusoidal positional encodings.

Combined spaCy lemmatization with SentencePiece BPE tokenization for stronger OOV handling.

Trained and evaluated on Europarl with beam search decoding and tracked perplexity and BLEU.

Workflow

Training flow

1

Preprocess and normalize the Europarl corpus, then train the SentencePiece tokenizer.

2

Train the Transformer and persist the best checkpoint by validation performance.

3

Evaluate with BLEU and then run beam-search inference on new sentences.

Execution model

The project is strongest when viewed as a full training and inference pipeline, not just a model experiment.

That clarity gives it more weight than a notebook-only translation prototype.

Actions
Case study

Pipeline depth

The translation system was built as an end-to-end pipeline, not just a Transformer implementation in isolation. Preprocessing, tokenization, architecture design, training, decoding, and evaluation all shape the final quality of the model, so the project treats them as one connected workflow. That approach makes the case study stronger because the final performance is tied to the full system rather than a single modeling component.

Why it matters

What gives the project weight is the full path from raw bilingual data to measurable translation output. By covering normalization, subword tokenization, beam search, and BLEU-based evaluation, the system reflects the actual engineering work required to make sequence models usable. It shows the difference between training a model and delivering a complete translation pipeline.