ML papers

Notes and reflections on papers I want to remember.

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

DeepSeekMath introduces a 7B parameter model that achieves state-of-the-art mathematical reasoning by continuing pre-training on 120B math-related tokens from Common Crawl and introducing Group Relative Policy Optimization (GRPO)

27-01-2025mathematical-reasoning · reinforcement-learning · grpo

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT (Bidirectional Encoder Representations from Transformers) pre-trains deep bidirectional transformer encoders using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) to learn contextual word representations that can be fine-tuned for various NLP tasks, achieving state-of-the-art results without task-specific architectures.

25-01-2025transformer · nlp · language-modeling · bert · pre-training · fine-tuning · masked-language-model · bidirectional

LoRA: Low-Rank Adaptation of Large Language Models

LoRA (Low-Rank Adaptation) enables parameter-efficient fine-tuning of large language models by decomposing weight updates into low-rank matrices, dramatically reducing trainable parameters while preserving pre-trained knowledge and allowing zero-inference-latency deployment through weight merging.

25-01-2025fine-tuning · low-rank · transfer-learning

RoFormer: Enhanced Transformer with Rotary Position Embedding

A deep dive into Rotary Position Embedding (RoPE), an elegant solution for encoding positional information in transformers that enables better length extrapolation and relative position modeling

25-01-2025transformers · position-encoding · attention · nlp

Adam: A Method for Stochastic Optimization

Adam (Adaptive Moment Estimation) is an optimization algorithm that adaptively adjusts learning rates per parameter by combining exponential moving averages of gradients (first moment) and squared gradients (second moment), with bias correction to achieve efficient stochastic optimization.

22-01-2025optimization · gradient-descent · adaptive-learning-rate · momentum · adam · stochastic-optimization

Generative Adversarial Networks (GAN)

Generative Adversarial Networks (GANs) train two competing neural networks (a generator that creates synthetic samples and a discriminator that distinguishes real from fake) in a minimax game to learn data distributions and generate realistic samples without explicit density modeling.

22-01-2025generative-models · adversarial-training · deep-learning · neural-networks · gan

Playing Atari with Deep Reinforcement Learning

Deep Q-Network (DQN) combines Q-learning with convolutional neural networks to learn control policies directly from raw pixel inputs in Atari games, using experience replay to stabilize training and achieve human-level performance.

22-01-2025reinforcement-learning · deep-learning · q-learning · dqn · cnn

Deep Neural Networks for YouTube Recommendations

YouTube's deep neural network-based recommendation system using a two-stage architecture with candidate generation and ranking, incorporating negative sampling, importance weighting, and features like watch history and search queries to provide personalized video recommendations.

21-01-2025recsys · deep-learning · neural-networks · candidate-generation · ranking

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Vision Transformer (ViT) replaces convolutions with a pure Transformer encoder over image patches, using a learnable [class] token and minimal inductive bias to achieve strong image recognition performance at scale.

20-01-2025transformer · attention