Machine Learning Interview Guide: Complete Preparation for Research Scientists and ML Engineers

Abstract
Landing a Research Scientist or Machine Learning Engineer role at a top AI lab is one of the most demanding hiring processes in technology. This guide consolidates field-tested preparation strategies covering the complete interview lifecycle: securing interviews through publications and targeted outreach; navigating every technical round - algorithms coding, ML coding, deep learning theory, transformer and LLM questions, reinforcement learning, ML system design; managing process logistics and timing; negotiating compensation including the mechanics of RSUs versus stock options; and making the final offer decision. A comprehensive topic reference of 90+ ML concepts, a preparation checklist, and an expanded FAQ close the guide.
1. Introduction: What This Machine Learning Interview Guide Covers
There is almost no practical, candid information available on what it actually takes to secure a Research Scientist or Machine Learning Engineer position at a top AI lab. This guide aims to fill that gap - written from firsthand experience running a full interview process across multiple frontier AI labs and AI-focused companies, combined with structured research into primary-source papers and publicly available preparation material.
A few honest context notes upfront. Some processes conclude after an offer has already been signed elsewhere - that is normal and does not mean those companies were uninterested. A rejection at one track (e.g. Research Scientist) sometimes comes with a redirect to a different track (e.g. an Engineering or Applied role) at the same company. And applications to several strong-fit companies will not progress past the initial screen at all, even with a strong profile - résumé and applicant-tracking-system screening is noisy at high-volume employers, and research from Harvard Business School and Accenture on hiring funnels documents how qualified candidates are routinely filtered out for reasons that have little to do with their actual ability.
This guide covers machine learning interview preparation end-to-end: how to get the first call, how to prepare for every technical round, which topics to study (deep learning, transformers, LLMs, reinforcement learning, distributed training, ML system design), how to handle logistics and negotiation, and how to make the final decision. Specific question banks for each interview type are included throughout.
If you need to first build the underlying knowledge before tackling interview prep, our Complete ML Learning Roadmap covers each subject area in depth across a structured 12–18 month path from beginner to interview-ready.
2. Machine Learning Interview Preparation Checklist
Use this checklist to track readiness across every dimension of the ML interview process. Candidates who arrive at first-round interviews without completing the coding and implementation items are consistently underprepared, regardless of research depth.
2.1 Application and Pipeline Readiness
- CV updated with accurate, specific descriptions of research contributions
- Target companies identified and prioritised (at least 8–12 applications)
- Cold email templates drafted for key hiring managers
- LinkedIn and Google Scholar profiles current and consistent
- Spreadsheet set up to track company, stage, deadlines, and contacts
- At least one warm referral activated per target company (where possible)
2.2 Algorithms Coding Readiness
- Blind 75 completed with solutions understood (not just memorised)
- NeetCode 150 Mediums completed (or equivalent ~150 Medium problems)
- Core patterns fluent: DFS, BFS, graphs, dynamic programming, backtracking, binary search, two pointers, sliding window
- Able to solve a LeetCode Medium in under 20 minutes consistently
- Optimal time complexity known for each core pattern (not just a working solution)
2.3 ML Coding and Implementation Readiness
- Scaled dot-product attention implemented from scratch (PyTorch or JAX)
- Multi-head attention implemented from scratch with correct head split/merge
- Full transformer block implemented (encoder, decoder, or decoder-only)
- Flash Attention algorithm understood at implementation level
- Attention backward pass derived from first principles
- MLP forward pass and manual backpropagation implemented
- Training loop with gradient clipping written without reference
- Common debugging scenarios practised: gradient explosion, NaN losses, shape mismatches
- LoRA implementation understood (low-rank decomposition mechanics)
2.4 ML Theory and Knowledge Readiness
- Flashcards written (not downloaded) for all 90+ topics in the reference list
- RLHF pipeline understood end-to-end (reward model, PPO, GRPO)
- Diffusion model forward and reverse processes explained without notes
- Scaling laws, MoE, and positional encoding variants (RoPE, sinusoidal) reviewed
- Distributed training paradigms covered: DDP, FSDP, tensor parallelism, pipeline parallelism
- At least one ML system design question practised end-to-end
- Research papers on CV read and opinions prepared (at least 3 recent papers)
2.5 Logistics and Process Readiness
- Interview schedule sequenced: lower-priority companies first
- All active processes disclosed to each company (standard practice)
- On-demand coding tests held until most target companies have scheduled
- Competing offers understood well enough to negotiate with specifics
- RSU vs stock option mechanics understood before any offer arrives
- Pre-interview routine established (sleep, exercise, cognitive reset)
3. How to Get Machine Learning Interviews at Top AI Labs
Getting past the application stage is its own challenge. The main levers are well-known: more papers, trendier research topics, and stronger internship experience. A rough benchmark that holds across top labs: 3 or more first-author papers published at ICLR, NeurIPS, or ICML, plus at least one internship or prior industry role, is typically the threshold for consistent callbacks. Work in high-demand areas - large language models, RLHF, diffusion models, and multimodal AI - generates substantially more traction than equally strong work in less fashionable areas.
3.1 Application Channels That Actually Work
LinkedIn and X (Twitter): Many labs - especially for internships - advertise roles that require filling out a linked Google Form to be considered. Simply clicking "Apply" on LinkedIn Jobs itself is often not sufficient. Following researchers and hiring managers at target companies is the most reliable way to catch these posts before they fill.
Referrals: Helpful, but not necessary. At competitive labs, referred and unreferred candidates both receive interview invitations - referrals can accelerate timelines and increase visibility, but their absence should not prevent an application. If a connection exists at a target company and no progress has been made after applying directly, asking for a referral is worth attempting.
Cold emails: Emailing a hiring manager or team member directly is frequently appreciated. The email should not simply restate a CV - it should explain specifically why the candidate is a strong fit for that team and what genuinely excites them about the work. A well-targeted cold email to a frontier AI lab can and does get a direct, personal reply from a hiring manager often enough to be worth the effort. Even in cases where the email goes unanswered, interviews sometimes still proceed - but the outreach creates a useful secondary signal of genuine interest.
Cover letters: Rarely required, but worth doing properly when they are. Cover letters generated wholesale by an AI model are easy to identify and make a poor impression. The better approach: write the letter authentically, then use a tool to polish the prose. Personality and genuine excitement are what differentiate these letters.
3.2 Research Scientist at a Startup vs a Large AI Lab
| Factor | Large AI Lab (DeepMind, Anthropic, Meta AI, OpenAI) | AI Startup (seed to Series C) |
|---|---|---|
| Discoverability | Roles are public and well-known; competition is extremely high | Harder to find - word of mouth is the primary channel; lower competition |
| Research Work | Consistently high-quality; access to large compute resources | Can be world-class or engineering-focused; agenda may shift frequently |
| Growth & Visibility | One of many researchers; slower path to ownership | High visibility; faster growth and influence over research direction |
| Compensation | Salary plus RSUs with relatively clear liquidity path | Salary plus stock options - upside possible but should be heavily discounted |
| CV Signal | Immediately recognised globally | Requires explanation; less portable unless company achieves prominence |
| Compute Access | Large-scale GPU/TPU clusters for pretraining and experiments | Variable - may be constrained; cloud credits common at early stage |
4. Machine Learning Interview Structure: What to Expect at Each Stage
Most top AI labs and ML-focused tech companies follow a broadly similar interview structure, though the weight and difficulty given to each stage varies considerably. Expect 3–8 technical interviews depending on the lab. The rounds below represent the full spectrum - not every company uses all of them.
| # | Round | Duration | What Is Tested |
|---|---|---|---|
| 01 | Recruiter Screen | 30–45 min | Background, motivation, role fit, basic CV probing |
| 02 | Algorithms Coding | 45–60 min | LeetCode-style Mediums: DFS/BFS, DP, binary search, two pointers |
| 03 | ML Coding & Debugging | 45–60 min | Implementing ML components from scratch; debugging existing code |
| 04 | ML Theory & Knowledge | 45–60 min | Deep learning, transformers, LLMs, RL, distributed training concepts |
| 05 | Behavioural / Research Discussion | 30–60 min | STAR stories, research interests, opinions on frontier work |
| 06 | ML System Design | 45–60 min | Open-ended design of ML pipelines, infrastructure, and serving systems |
5. Algorithms Coding Interview
The algorithms coding round tests LeetCode-style problem solving at Medium difficulty. This round is separate from ML coding - it evaluates pure data structures and algorithms fluency and is a prerequisite gate at most top labs.
5.1 Core Patterns to Master
- Graph Traversal (DFS / BFS): Connected components, shortest paths, cycle detection
- Dynamic Programming: 1D and 2D DP, memoisation vs tabulation, common templates
- Binary Search: Search on answer, rotated arrays, finding boundaries
- Two Pointers: Sorted arrays, palindromes, container problems
- Sliding Window: Variable and fixed windows, frequency counting
- Backtracking: Permutations, combinations, N-queens variants
- Heap / Priority Queue: Top-k problems, merge k sorted lists
- Trees: Traversals, LCA, diameter, serialisation
5.2 Most Common ML Interview Coding Questions
The questions below appear across virtually every ML interview process at top labs. Treat each as a starting point for deeper pattern exploration.
Machine Learning Fundamentals Questions
- Explain the bias-variance tradeoff. How does it relate to model complexity?
- What is the difference between L1 and L2 regularisation? When would you use each?
- Explain backpropagation from first principles. Derive the gradient for a two-layer network.
- What is the vanishing gradient problem, and how do residual connections address it?
- Explain batch normalisation, layer normalisation, and RMSNorm. What are the differences?
- What is the difference between MLE and MAP estimation? Give a concrete example.
- Explain precision, recall, F1, and AUC-ROC. When does each metric matter most?
- What is KL divergence? How is it used in VAEs?
- Explain the Adam optimiser. What does it do that SGD with momentum does not?
- What is the curse of dimensionality? How does it affect nearest-neighbour search?
Research Scientist Interview Questions
- Walk me through your most significant research contribution. What made it non-obvious?
- What open problem in machine learning do you find most interesting right now, and why?
- You have a negative result in your experiments. How do you decide whether to change the hypothesis or the method?
- Describe a time you disagreed with a collaborator on a research direction. How was it resolved?
- How do you approach reproducing results from a paper that lacks sufficient implementation detail?
- What is your opinion on the current direction of scaling language models? Is scaling sufficient for AGI?
- Pick a recent paper you found compelling. What would you do next to extend it?
5.3 Preparation Strategy
Complete Blind 75 first - these cover the essential patterns. Then work through approximately 75 additional NeetCode Mediums with a focus on Dynamic Programming and Graphs, which appear most frequently in ML lab interviews. Target 20 minutes or fewer per Medium problem. If stuck beyond 15 minutes, look up the solution, understand the pattern, flag it for review, and move on. Breadth across patterns matters more than deep mastery of any single type.
6. ML Coding and Implementation Interview
The ML coding round tests two distinct skills: implementing ML components from scratch with correct logic and tensor shapes, and debugging existing ML code efficiently. Both require deep familiarity with PyTorch or JAX at the level of understanding what every line does - not just using high-level APIs.
6.1 Implementation Questions
- Implement scaled dot-product attention from scratch. Include the masking logic for causal attention.
- Implement multi-head attention. Correctly handle the split and merge of heads.
- Implement a full transformer decoder block (self-attention + FFN + residuals + layer norm).
- Implement Flash Attention at an algorithmic level. Explain why it is memory-efficient.
- Implement the backward pass for softmax from first principles.
- Implement a simple training loop with gradient clipping in PyTorch. Include a learning rate scheduler.
- Implement LoRA (Low-Rank Adaptation). Explain how it reduces trainable parameters.
- Implement k-means clustering from scratch. Discuss convergence and initialisation.
- Implement dropout correctly during training and inference.
- Implement sinusoidal and RoPE positional encodings.
6.2 Reference Implementations
# Scaled dot-product attention - from scratch
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Args:
Q, K, V: (batch, heads, seq_len, d_k)
mask: optional causal or padding mask (bool tensor)
Returns:
output: (batch, heads, seq_len, d_k)
weights: (batch, heads, seq_len, seq_len)
"""
d_k = Q.size(-1)
scores = Q @ K.transpose(-2, -1) / d_k ** 0.5 # (B, H, T, T)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = F.softmax(scores, dim=-1)
return weights @ V, weights # (B, H, T, d_v)
# LoRA: Low-Rank Adaptation of a linear layer
import torch.nn as nn
class LoRALinear(nn.Module):
def __init__(self, in_features, out_features, rank=4, alpha=1.0):
super().__init__()
self.weight = nn.Parameter(torch.randn(out_features, in_features))
self.A = nn.Parameter(torch.randn(rank, in_features) * 0.01)
self.B = nn.Parameter(torch.zeros(out_features, rank))
self.alpha = alpha
self.rank = rank
def forward(self, x):
base = x @ self.weight.T
lora = x @ self.A.T @ self.B.T
return base + (self.alpha / self.rank) * lora
# Multi-head attention - complete module
import torch
import torch.nn as nn
import torch.nn.functional as F
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
assert d_model % n_heads == 0
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.W_q = nn.Linear(d_model, d_model, bias=False)
self.W_k = nn.Linear(d_model, d_model, bias=False)
self.W_v = nn.Linear(d_model, d_model, bias=False)
self.W_o = nn.Linear(d_model, d_model, bias=False)
def forward(self, x, mask=None):
B, T, C = x.shape
# Project and reshape to (B, H, T, d_k)
Q = self.W_q(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
K = self.W_k(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
V = self.W_v(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
# Scaled dot-product attention
scores = Q @ K.transpose(-2, -1) / self.d_k ** 0.5
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attn = F.softmax(scores, dim=-1)
out = attn @ V # (B, H, T, d_k)
# Merge heads and project
out = out.transpose(1, 2).contiguous().view(B, T, C)
return self.W_o(out)
6.3 Debugging Questions
- This training loop is producing NaN losses after 200 steps. What are the most likely causes and how do you diagnose them?
- Training loss is decreasing but validation loss is flat. What is happening and what would you change?
- This attention implementation produces shapes (B, H, T, T) but the output is wrong. Find the bug.
- Gradient norms are exploding after 50 steps. What would you check first?
- A PyTorch model trains correctly on CPU but produces wrong outputs on GPU. What could cause this?
6.4 ML Implementation Checklist
- Full transformer architecture from scratch - reference nanoGPT
- Causal self-attention, cross-attention, and multi-head attention
- Flash Attention - algorithm-level understanding and implementation rationale
- Attention backward pass from first principles
- MLP forward pass and manual backpropagation
- LoRA implementation (low-rank decomposition of weight updates)
- Training loop with gradient clipping in PyTorch or JAX
- Debugging: gradient explosion, NaN losses, shape mismatches, training loop bugs
- RLHF pipeline: reward model training, PPO loop, KL penalty
- Diffusion model forward process (noise schedule) and reverse process (DDPM/DDIM)
7. Deep Learning Theory Interview Questions
Deep learning questions test both conceptual understanding and the ability to reason from first principles. Interviewers at top AI labs expect derivations, not just definitions.
7.1 Architecture and Training
- Explain the residual connection in ResNets. Why does adding the identity shortcut help training?
- Derive the backpropagation update rule for a two-layer neural network with cross-entropy loss.
- Explain why weight initialisation matters. Describe Xavier and He initialisation and when each is used.
- What is the difference between batch normalisation and layer normalisation? Why do transformers use layer norm?
- Explain how dropout acts as a regulariser. What is it approximating at test time?
- What causes the exploding and vanishing gradient problems? How do gradient clipping and residual connections address them?
- Explain the difference between RNNs, LSTMs, and GRUs. What limitation of RNNs do LSTMs solve?
- Explain convolutional neural networks. What is the inductive bias that makes them effective for image data?
- What is the difference between online, mini-batch, and full-batch gradient descent? What are the tradeoffs?
7.2 Optimisation
- Derive the Adam update rule. What do the first and second moment estimates represent?
- What is the learning rate warmup and why is it important for transformer training?
- Explain cosine annealing with warm restarts. When is it preferred over a fixed learning rate?
- What is gradient accumulation? When and why would you use it?
- Explain mixed precision training (BF16 vs FP16 vs FP32). What are the stability differences?
7.3 Generative Modelling
- Explain the VAE objective (ELBO). What is the role of the KL term and the reconstruction term?
- Explain how GANs work. What is the minimax objective? What training instabilities can arise?
- Explain diffusion models. What happens in the forward process? What does the model learn to predict in the reverse process?
- What is DDIM? How does it differ from DDPM in sampling?
- Explain flow matching. What is the ODE it solves and how does it relate to diffusion?
- What is classifier-free guidance and why does it improve sample quality?
8. Transformer and LLM Interview Questions
Transformer architecture questions are among the most heavily tested in ML interviews at research-focused labs. Interviewers expect candidates to understand transformers at implementation level - not just at the level of calling nn.MultiheadAttention. See our companion Transformers Explained article for textbook-level derivations of each component.
8.1 Architecture Questions
- Explain the scaled dot-product attention mechanism. Why is the scaling by √d_k necessary?
- What is multi-head attention? What does each head learn to attend to?
- Explain the difference between encoder-only, decoder-only, and encoder-decoder transformer architectures. Give examples of each.
- What is causal (masked) self-attention? Why is it necessary for autoregressive language models?
- Explain cross-attention. In what architectures is it used?
- What is the role of the feed-forward sublayer in a transformer block? What is the typical expansion ratio?
- Explain sinusoidal positional encodings. Why are they used rather than learned embeddings?
- Explain RoPE (Rotary Position Embedding). What advantage does it have for handling longer context?
- What is Flash Attention? Explain the tiling algorithm and why it reduces memory usage from O(n²) to O(n).
- What is Transformer-XL? How does it handle long-range dependencies beyond a fixed context window?
- What is the Griffin architecture? How does it combine recurrence and attention?
8.2 LLM Training and Post-Training Alignment
- Explain RLHF (Reinforcement Learning from Human Feedback) end-to-end. What are the three training stages?
- What is the reward model in RLHF? How is it trained? What are its failure modes?
- Explain PPO as used in RLHF fine-tuning. What does the KL penalty term do?
- What is DPO (Direct Preference Optimisation)? How does it differ from PPO-based RLHF?
- What is GRPO? How does it differ from PPO in the context of reasoning models?
- Explain instruction tuning (SFT). What data format is used and what does it teach the model?
- What is the Chinchilla scaling law? What was its key finding about compute-optimal training?
- Explain KV cache. Why is it used during inference and what memory tradeoffs does it create?
- What is speculative decoding? How does it improve inference throughput?
- What is quantisation (INT8, INT4)? What accuracy tradeoffs does it introduce?
- Explain context length extension techniques: RoPE/position interpolation, ALiBi, and YaRN.
- What are hallucinations in LLMs? What are the known mechanisms and mitigations?
- Explain retrieval-augmented generation (RAG). What problems does it solve and what are its limitations?
8.3 Training and Efficiency
- What are scaling laws for transformer language models? What do they predict about the optimal balance of compute, data, and parameters?
- Explain Mixture of Experts (MoE). How does conditional computation reduce cost per token?
- What is LoRA? Derive why it reduces the number of trainable parameters during fine-tuning.
- Explain tokenisation. What is BPE and why is it used for LLM vocabularies?
- What are the standard decoding strategies for language models? Compare greedy, beam search, top-k, and nucleus sampling.
9. Reinforcement Learning Interview Questions
Reinforcement learning questions are heavily tested at labs whose work intersects with alignment, reasoning, robotics, and game-playing AI. Candidates with RL research backgrounds should expect depth; those from other areas should ensure solid coverage of fundamentals and the RLHF pipeline.
9.1 Foundations
- Define a Markov Decision Process (MDP). What are the components (S, A, R, P, γ)?
- What is the Bellman equation? Derive the Bellman optimality equation for Q-values.
- Explain the difference between on-policy and off-policy learning. Give an example of each.
- What is temporal difference (TD) learning? How does it differ from Monte Carlo methods?
- Explain Q-learning. Why is it off-policy? What is its convergence guarantee?
- Explain the policy gradient theorem. Derive the REINFORCE update rule.
- What is the credit assignment problem? How do algorithms like GAE address it?
- Explain the exploration vs exploitation tradeoff. What strategies exist beyond ε-greedy?
9.2 Modern Algorithms
- Explain PPO (Proximal Policy Optimisation). What is the clipped surrogate objective and why is clipping necessary?
- Explain GAE (Generalised Advantage Estimation). What does the λ parameter control?
- Explain Soft Actor-Critic (SAC). What is the maximum entropy objective and what problem does it solve?
- What is GRPO? How is it used in reinforcement learning for reasoning in LLMs?
- Explain actor-critic methods. How do they combine policy gradient and value function learning?
- What is importance sampling? How is it used in off-policy policy gradient methods?
- Explain model-based RL. What are the tradeoffs compared to model-free approaches?
- Explain MuZero. How does it perform planning without access to the true environment dynamics?
- What is curriculum learning in RL? How is it used to train agents on complex tasks?
10. ML System Design Interview
ML system design rounds are more common in ML Engineer and Applied Scientist interviews than in pure Research Scientist tracks, but they appear across all roles at scale-focused labs. The format is open-ended: a problem is stated, and the candidate is expected to drive the design. Communication, reasoning about tradeoffs, and awareness of real-world constraints matter as much as technical depth.
10.1 Common ML System Design Questions
- Design a system to train a large language model across hundreds of GPUs. What parallelism strategies would you use?
- Design a recommendation system for a platform with 100M users and 10M items. How do you handle cold start?
- Design an online learning system that updates a model in real time from user feedback. What are the key failure modes?
- Design a retrieval-augmented generation (RAG) system for a legal document assistant. How do you handle retrieval latency?
- Design a model serving infrastructure for a 70B parameter LLM with a P95 latency target of 500ms. What optimisations would you apply?
- Design a distributed training pipeline that minimises communication overhead between nodes. Compare DDP, FSDP, and tensor parallelism.
- You are training a diffusion model and GPU memory is the bottleneck. What techniques would you apply?
- Design a data pipeline for pretraining a 7B parameter language model. How do you handle data quality, deduplication, and filtering at scale?
10.2 Key Distributed Training Concepts
| Strategy | What Is Parallelised | Memory Savings | Communication Overhead | Best Used When |
|---|---|---|---|---|
| Data Parallel (DDP) | Data batches across GPUs | None (model replicated) | Gradient AllReduce each step | Model fits on one GPU |
| Fully Sharded (FSDP) | Parameters, gradients, optimiser states | High - linear in world size | AllGather + ReduceScatter | Model too large for one GPU |
| Tensor Parallel | Individual weight matrices | High - linear in TP degree | AllReduce within each layer | Very large single layers (attention, FFN) |
| Pipeline Parallel | Model layers across stages | Moderate | P2P activation passing | Very deep models with many layers |
11. Behavioural and Research Discussion Round
Two types of round appear here: classic behavioural (STAR-format conflict, feedback, and collaboration questions) and research discussion (interests, field vision, and opinions on recent work). Both are underestimated compared to technical rounds - prepare for each with the same rigour.
11.1 Behavioural Questions (STAR Format)
- Tell me about a time you disagreed with a colleague or manager on a technical decision. How did you handle it?
- Describe a situation where you had to deliver critical feedback to a peer. How did you approach it?
- Tell me about a project that failed. What happened and what did you take away?
- Describe a time you had to learn a new technical area quickly. How did you approach it?
- Tell me about a time you had competing priorities. How did you decide what to focus on?
- Describe your most significant collaborative research contribution. What was your specific role?
11.2 Research Discussion Questions
- What open problem in ML do you find most interesting right now, and why?
- What is your opinion on the current direction of scaling language models?
- Walk me through your most significant research contribution. What made it non-obvious?
- Pick a recent paper you found compelling. What would you do next to extend it?
- How do you approach reproducing results from a paper that lacks sufficient implementation detail?
- You have a negative result in your experiments. How do you decide whether to change the hypothesis or the method?
12. Understanding Compensation: RSUs vs Stock Options
Equity compensation is frequently misunderstood at the point of offer evaluation. The distinction between RSUs and stock options matters considerably - particularly under UK law and taxation - and candidates who do not understand the mechanics are consistently at a disadvantage during negotiation.
| RSUs | Stock Options | |
|---|---|---|
| What you get | Actual shares, on vesting | The right to buy shares at a fixed price |
| Typical at | Large, established companies | Early- to mid-stage AI startups |
| Value if stock falls | Still worth something | Can be worth nothing |
| Tax trigger | At vesting (ordinary income) | At exercise, then again at sale |
| Cash needed upfront | None | Often yes - to exercise and pay tax |
RSUs (Restricted Stock Units) - typical at large tech companies such as Google DeepMind, Meta, and Microsoft - grant actual shares in the company on a vesting schedule. When they vest, shares can be sold or held. A portion - often around half - is automatically sold to cover income tax, because vested RSUs are treated as ordinary income.
Stock options - typical at AI startups - grant the right to purchase shares at a fixed strike price X, regardless of the market price Y at exercise. If Y > X, exercising and selling generates a profit. If Y < X, the options are worthless. In the UK, options granted under a qualifying Enterprise Management Incentive (EMI) scheme carry meaningfully better tax treatment than unapproved options - confirm with the offering company which type is being granted before modelling the numbers below.
To make the mechanics concrete, consider a UK stock option scenario:
| Item | Value |
|---|---|
| Strike price (X) | £10 per share |
| Current valuation price (Y) | £50 per share |
| Options granted | 10,000 |
| Cost to exercise | £100,000 |
| Paper gain subject to income tax | £400,000 |
| Income tax owed (45% rate) | £180,000 |
| Total cash outlay before seeing a penny | £280,000 |
When a recruiter quotes a total compensation figure that includes startup equity, apply a significant mental discount to that number. Cashless exercise options and liquidity events can reduce this burden, but each new funding round dilutes existing shares, and liquidity events typically value stock below the official company valuation.
13. How to Negotiate a Machine Learning Job Offer
Companies consistently have more room to improve initial offers than candidates typically expect. Negotiating is always worth attempting - the downside is essentially zero, and material increases of 10–30% on total compensation are common when handled correctly.
The "blind auction" approach - not revealing competing offers to preserve leverage - frequently does not work in practice at top AI labs. Many companies explicitly request proof of competing offers before authorising increases, and some will verify the details. Sharing offers openly, when asked, is usually the more effective strategy.
"Recruiters are skilled at reading genuine preferences from small signals - how often a candidate mentions a company, the tone they use when discussing it. If a recruiter knows their company is already the top choice, negotiating leverage is significantly reduced."
13.1 Practical Negotiation Notes
- Deadlines range from one week to two weeks or longer and are not always flexible - though some companies will extend for the right candidate.
- Companies maintain historical data on candidate decisions. A competing offer from a peer lab (OpenAI, Anthropic, Google DeepMind, Meta AI) carries genuine weight because the data shows it is a real alternative.
- An offer from a company that candidates rarely choose over the target lab does not create meaningful leverage, regardless of the headline number.
- Negotiate base salary, signing bonus, and RSU grant separately - each has different flexibility.
- Tell every company about other active processes. This is standard practice, keeps timelines clear, and prompts companies to move faster when they are interested.
14. How to Choose Between Machine Learning Job Offers
14.1 Factors Worth Weighting Heavily
- Research agenda alignment: Can you see yourself excited by this team's direction in two years, not just today?
- Team quality: Speak with at least two team members who are not the hiring manager before deciding.
- Compute access: For research roles, GPU and TPU access is a direct constraint on what research is possible.
- Equity mechanics: Understand whether equity is RSUs or options before signing - see Compensation for the mechanics.
- Publication norms: Some labs publish aggressively; others are more closed. Know which environment suits you.
- Location and life: Career decisions that work against life outside of work compound negatively over time.
15. Preparing Mentally for the ML Interview Process
The machine learning interview process at top AI labs is genuinely demanding, and its cumulative effect on mental resilience should not be underestimated - even for candidates who handle pressure well in other contexts.
15.1 Practical Strategies for Sustaining Performance
- Sleep: Sleep disruption the night before high-stakes interviews is common and compounds significantly across multiple same-week interviews. Prioritise a consistent sleep routine in the weeks before the process begins - not under pressure on the eve of an interview.
- Physical preparation: Exercise before interviews - particularly cardiovascular activity - reduces nervous energy and resets cognitive state. Keep intensity easy and eat sufficient carbohydrates beforehand.
- Social connection: Isolation during an intense interview period is counterproductive. Structure at least some social time on evenings without early-morning interviews.
- Pre-interview ritual: A consistent pre-interview routine provides a reliable anchor against unpredictable anxiety. The specific content matters less than the consistency.
15.2 Recommended Reading on Mindset and Performance
| Book | Why It Helps |
|---|---|
| The Now Habit - Neil Fiore | Practical approach to procrastination and performance anxiety rooted in psychological research. Particularly useful for candidates who procrastinate on interview preparation. |
| Mindset - Carol Dweck | The foundational text on growth versus fixed mindset. Directly applicable to navigating rejection during an extended interview process. |
| The Gifts of Imperfection - Brené Brown | Useful for the underlying work of decoupling self-worth from professional outcomes - which becomes important when facing multiple simultaneous rejections. |
16. Full Machine Learning Interview Topic Reference
The topics below represent a comprehensive review list compiled across a complete interview process at top AI labs. Nearly every topic on this list appeared in at least one interview in some form.
| Category | Topics |
|---|---|
| Reinforcement Learning | Q-Learning / TD Learning, Bellman Equations, PPO, GRPO, GAE, Variance Reduction, DPO, Policy Gradient Theorem, On-Policy vs Off-Policy, Exploration vs Exploitation, Credit Assignment, MuZero / World Models, AlphaGo / AlphaZero, SAC, Model-Based vs Model-Free, MDP, Monte Carlo vs TD, Actor-Critic, SARSA, Importance Sampling, Curriculum Learning |
| Large Language Models | Flash Attention, LoRA, TransformerXL, Griffin, Perceiver, Scaling Laws & Chinchilla, Mixture of Experts, RoPE, Sinusoidal Embeddings, Relative Positional Embeddings, LLM vs RNN vs S4 vs Mamba, Tokenisation (BPE), Pretraining & SFT, RLHF, Decoding Techniques, Causal Attention, Cross Attention, KV Cache, Speculative Decoding, Quantisation |
| Generative Modelling | GANs, VAEs and ELBO, Score Function / Score Matching, Diffusion Forward Process, Diffusion Reverse Process (DDIM / DDPM), Diffusion SDE, Flow Matching ODE, Classifier-Free Guidance |
| Distributed Systems | Tensor Parallelism, FSDP, DDP, Pipeline Parallelism, AllReduce / AllGather, Mixed Precision (BF16), Gradient Checkpointing, Gradient Accumulation & Clipping, Numerical Precision, JIT Compiling, JAX / PyTorch / TensorFlow |
| General ML | Curse of Dimensionality, S4 / CNNs / RNNs / LSTMs, Autoencoders, Gumbel-Softmax, MLE vs MAP, Newton's Method, Linear Regression, Activation & Loss Functions, No Free Lunch Theorem, BatchNorm / LayerNorm / RMSNorm, Adam / AdamW / Adagrad, Bias-Variance Tradeoff, Backpropagation, Regularisation (L1, L2, Dropout), Clustering, KNN, SVMs, Boosting, Decision Trees, Bayes Theorem, Precision / Recall / F1 / AUC-ROC, KL / JS Divergence, Xavier / He Init, Overfitting / Underfitting, Transfer Learning, Few-Shot / Zero-Shot |
| Linear Algebra | PSD Matrices, Jacobian & Hessian, Eigenvectors / Eigenvalues, Matrix Inverse, Dot Product, Null Space / Image Space, Orthogonality, Linear Independence, Singular Matrices, Rank / Span, Determinant, SVD |
17. What to Do Differently: Lessons from a Completed ML Interview Process
Even in a successful process - one that concluded with multiple competing offers from frontier AI labs and AI-focused companies - certain approaches would have produced better outcomes. These are the highest-value changes in retrospect.
- Track everything in a spreadsheet from day one. Missing applications to genuinely interesting companies because of tracking failures is avoidable and frustrating. Use Notion or a simple Google Sheet with columns for company, current stage, deadlines, key contacts, and notes from each round.
- Prepare emotionally before the process starts, not during it. The interview process has a way of feeling like a verdict on years of research. Developing a healthier relationship with failure and professional setbacks in advance is significantly more effective than attempting this work mid-process.
- Be proactive about companies that go silent. If an application has not resulted in a reply and the company is genuinely interesting, a cold email to someone on the team is a more effective response than passive waiting.
- Start ML implementation practice earlier. The gap between knowing how attention works conceptually and being able to implement it cleanly under time pressure is larger than most researchers expect. Four weeks of implementation practice is the minimum; eight weeks is better.
- Sequence interviews by preference. Start with lower-priority companies to build calibration and confidence before processes that actually matter reach their final stages.
- Manage timing deliberately. The goal is to have multiple offers arrive in the same window. If Company A offers an on-demand test, hold it until Company B's first interview is scheduled. Process timing is difficult to control precisely, but deliberate management is possible and meaningfully changes negotiation leverage.
Recommended Resources for ML Interview Preparation
| Type | Resource | What It Covers |
|---|---|---|
| Book | Designing ML Systems - Chip Huyen | Applied ML, system design, and ML fundamentals at interview depth. Highlight as you read - it doubles as a flashcard source. |
| Book (Online) | The JAX Scaling Book | Distributed training, parallelism strategies (DDP, FSDP, tensor parallelism, pipeline parallelism), and large-scale ML systems. |
| Book | Reinforcement Learning - Sutton & Barto | Only necessary if new to RL. Skim chapters 1–6 and focus on policy gradient methods for interview coverage. |
| Practice | NeetCode 150 | Structured LeetCode practice covering all core algorithmic patterns. Start here before any lab interview. |
| Practice | DeepML | ML coding practice problems including attention, backpropagation, and training loop exercises. |
| Code Reference | nanoGPT - Karpathy | Clean, minimal transformer implementation in PyTorch. Read and re-implement from scratch as a preparation exercise. |
| Course | OpenAI Spinning Up | Practical deep RL introduction covering policy gradient, PPO, SAC, and TRPO with working implementations. |
Frequently Asked Questions
- Most candidates need 6–12 weeks of structured preparation. Spend the first 2 weeks on algorithms coding (LeetCode Mediums), the next 3–4 weeks on ML implementation and theory flashcards, and the final 2–3 weeks on company-specific mock interviews and ML system design. Candidates with weaker coding foundations should add 2–4 weeks of pure LeetCode practice first. Starting preparation before sending any applications is strongly recommended.
- Transformer architecture and attention mechanisms - scaled dot-product attention, multi-head attention, causal masking, Flash Attention - are tested at virtually every lab. LLM training and alignment (RLHF, SFT, DPO, PPO, GRPO) is the second most consistently covered area. Deep learning fundamentals (backpropagation derivations, optimisers, normalisation schemes) and distributed training (DDP, FSDP, tensor parallelism) round out the most heavily tested themes.
- A PhD is not strictly required, but it is the default expectation at most top labs (DeepMind, Anthropic, OpenAI, Meta AI) for Research Scientist roles. A strong publication record - 3 or more first-author papers at ICLR, NeurIPS, or ICML - can substitute in practice. ML Engineer roles are more accessible without a PhD but still demand strong implementation skills. Senior RS roles almost universally expect a PhD or equivalent research output.
- The ML coding round is consistently identified as the most difficult by candidates who have not practised implementation. Writing a full transformer decoder from scratch - with correct tensor shapes throughout, causal masking, residual connections, and layer normalisation - under 45-minute time pressure is substantially harder than understanding the architecture conceptually. The gap between knowing how attention works and implementing it cleanly under time pressure is real and requires dedicated practice to close.
- Complete at least Blind 75 and approximately 75 additional NeetCode Mediums, for a total of around 150 Medium problems. Hard problems appear occasionally but are rare; fluency with every Medium pattern is more valuable than sporadic Hard attempts. The core patterns tested are DFS/BFS, dynamic programming, binary search, two pointers, sliding window, and backtracking. Target 20 minutes or fewer per Medium as your benchmark.
- The most commonly asked transformer questions are: explain scaled dot-product attention and why scaling by √d_k is necessary; what does multi-head attention provide that single-head cannot; explain causal masking for autoregressive language models; compare sinusoidal vs learned vs RoPE positional encodings; explain the Flash Attention tiling algorithm and its O(n) memory guarantee; explain Mixture of Experts and conditional computation; and describe the encoder-only / decoder-only / encoder-decoder architectures with concrete examples.
- RLHF (Reinforcement Learning from Human Feedback) is the three-stage post-training alignment pipeline: (1) supervised fine-tuning on high-quality demonstrations, (2) reward model training on human preference comparison pairs, and (3) RL fine-tuning - typically PPO - against the reward model with a KL divergence penalty to prevent the policy from drifting too far from the SFT baseline. Interviewers expect end-to-end explanations of each stage, awareness of reward hacking failure modes, and the ability to compare PPO-based RLHF with DPO.
- Yes. Companies consistently have more room to improve initial offers than candidates expect - particularly on base salary, signing bonus, and RSU grant. The downside risk of negotiating respectfully is essentially zero. Increases of 10–30% on total compensation are achievable when you hold competing offers and communicate genuine interest clearly. Never accept the first number without at least asking whether there is flexibility.
- Research Scientists at top labs are expected to independently identify research problems, design experiments, and contribute publishable findings. The interview focuses heavily on research depth, publication record, and intellectual curiosity. ML Engineers focus on building, scaling, and productionising ML systems. The interview is more implementation and systems-heavy, with greater emphasis on ML system design, distributed training, and software engineering. Many labs hire both tracks and the distinction narrows at senior levels.
- ML system design rounds are open-ended: a problem is stated (e.g. "design a recommendation system for 100M users") and the candidate drives the discussion. The interviewer evaluates problem decomposition, awareness of real-world constraints, ability to reason about tradeoffs between approaches, and communication quality. Common questions include: designing training infrastructure for large models, recommendation and retrieval systems, RAG pipelines, online learning systems, and model serving for large LLMs with latency SLAs.
- Flash Attention is an IO-aware implementation of scaled dot-product attention that tiles the computation across GPU SRAM to avoid materialising the full N×N attention matrix in HBM (GPU memory). Standard attention has O(N²) memory complexity; Flash Attention reduces this to O(N) by computing softmax and the output in fused passes over tiles. The result is 2–4× speedup and a proportional memory saving with mathematically identical outputs. It is important in interviews because it appears on almost every ML coding and theory question list at research labs.
- LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that freezes the pretrained weight matrix W and adds a low-rank update ΔW = BA, where B ∈ R^(d×r) and A ∈ R^(r×k) with rank r ≪ min(d, k). During fine-tuning, only A and B are trained, reducing the number of trainable parameters from d×k to r×(d+k). For a d=4096 projection with r=8, LoRA reduces trainable parameters by ~256×. At inference, the update can be merged back into W with no added latency.
- PPO (Proximal Policy Optimisation) is an online RL algorithm that trains the language model policy against a separately trained reward model, using a clipped surrogate objective and a KL penalty to prevent instability. DPO (Direct Preference Optimisation) re-parameterises the RLHF objective to eliminate the need for an explicit reward model and RL loop: it directly optimises the policy on preference pairs using a binary cross-entropy loss. DPO is simpler to implement and more stable, but it is offline and may be less effective when the preference data distribution differs from the model's current distribution.
- The four primary paradigms are: Data Parallel (DDP) - each GPU holds a full model copy and gradients are AllReduced; Fully Sharded Data Parallel (FSDP) - parameters, gradients, and optimiser states are sharded across GPUs for large-model training; Tensor Parallelism - individual weight matrices are split across GPUs (requiring AllReduce within each layer); and Pipeline Parallelism - model layers are split across stages with micro-batch interleaving to hide bubble overhead. Know when to use each and the communication primitives each relies on (AllReduce, AllGather, ReduceScatter, P2P). See the ML System Design section for a side-by-side comparison table.
- A general SE interview at most tech companies is almost entirely algorithms coding plus a system design round. An ML research scientist interview adds three rounds that SE loops do not have: ML coding (implementing model components from scratch, not calling library functions), ML theory (derivations, not definitions), and a research discussion round where a senior researcher probes the depth and originality of past work. The algorithms coding bar is usually similar or slightly lower than at FAANG-style SE interviews, but the combined technical surface area is larger. ML Engineer interviews sit in between - heavier on ML system design and production concerns, lighter on open-ended research discussion.
- Write down what went wrong while it is fresh - the specific question, where the explanation broke down, and what the correct approach was - then close the laptop and stop reviewing it for the rest of the day. Rehashing a single round rarely improves the next one and reliably damages confidence going into it. A single weak round is not a reliable predictor of the final decision: interview panels typically weight the round average and look for consistent strengths, not a single low score in isolation. If a debrief or feedback call is offered, take it - specific, dated feedback is one of the few high-signal artifacts available during the process.
- Less common at the largest labs (DeepMind, OpenAI, Anthropic, Meta AI), more common at mid-size AI companies and startups, where they often replace a live ML coding round entirely. Typical formats: a small modelling task with a provided dataset and a write-up, or a focused implementation exercise (e.g., implement and train a small transformer on a toy dataset) with a time budget of a few hours to a few days. Treat the write-up with the same care as the code - reviewers read it as a proxy for how the candidate would communicate results to a research team. Clarify the time budget and evaluation criteria before starting if they are not specified.