Tanul Singh

Tanul Singh

ML Engineer · 5+ years in NLP & LLMs

# trained from scratch. no pre-trained weights.

Initialized with a Mechanical Engineering degree, pre-trained on curiosity, and fine-tuned by an unreasonable number of late nights — 5+ years of gradient descent through NLP, LLMs, and generative AI. Currently inference-serving at Apple. I write about ML here so you don't have to train from scratch too.

Senior ML Engineer

Apple — multi-agentic systems & LLM research

Kaggle Grandmaster

Notebooks GM, Competitions Master

US Patent Holder

Dynamic intent detection system

Self-Taught

ME degree → ML through sheer will

model.fit(life, epochs=∞, lr=persistence)

My Training Curve

Each milestone is a neuron. Experiences flow forward. Lessons backpropagate. The loss is still decreasing.

The SparkThe Hardest YearFound KaggleNotebooks GMMLE at LevelAISenior MLEPatent & PaperLead MLEAppleforward pass →← backprop (gradients)
forward pass (life moving forward)backprop (lessons learned)turning points

training_loss.plot()

epochs (time)1.00.0loss'17'18'19'20'21'22'23'24'25?
“You don’t need a low initial loss. You need a good learning rate and the patience to keep training.”

— the philosophy that took me from Mechanical Engineering to Apple

latest

Paper Explanations & Articles

TransformersNLPArchitecture

Beyond Attention: Anatomy of a Modern Transformer

Attention gets all the press, but the rest of the transformer matters just as much. This post traces how feed-forward networks, normalization, residual connections, activations, and embeddings evolved from the original 2017 design to the modern LLM recipe — and why seemingly small changes (removing a bias term, swapping an activation function) compound into models that train faster, scale further, and perform better.

May 15, 202641 min read
TransformersNLPAttention

Attention Part 4 — Flash Attention: Making GPUs Actually Work

The complete story of Flash Attention: why naive attention is memory-bound despite being 'just matrix multiplies', the GPU memory hierarchy that explains everything, the online softmax trick that makes tiling possible, and how it all composes into the most impactful systems optimization in modern transformers.

May 13, 202633 min read
PapersGemmaGoogle DeepMind

Dissecting Gemma 4: Architecture from the Ground Up

A complete architectural dissection of Gemma 4: how hybrid attention (sliding window + global layers), Mixture of Experts, and careful design choices compose into a model that handles 256K tokens efficiently.

May 13, 20262 min read
PapersSparse AttentionDeepSeek

Native Sparse Attention: Hardware-Aligned Learned Sparsity

A ground-up dissection of NSA: how compressed attention, content-aware routing, and sliding windows compose into one mechanism. From the paper's math to working PyTorch, with hardware-alignment insights.

May 13, 20263 min read

Want to know more?

Watch my conversations with people who shaped the Indian ML community: