Career Open Access

Machine Learning Interview Guide: Complete Preparation for Research Scientists and ML Engineers

Machine learning interview preparation roadmap and guide

Abstract

Landing a Research Scientist or Machine Learning Engineer role at a top AI lab is one of the most demanding hiring processes in technology. This guide consolidates field-tested preparation strategies covering the complete interview lifecycle: securing interviews through publications and targeted outreach; navigating every technical round - algorithms coding, ML coding, deep learning theory, transformer and LLM questions, reinforcement learning, ML system design; managing process logistics and timing; negotiating compensation including the mechanics of RSUs versus stock options; and making the final offer decision. A comprehensive topic reference of 90+ ML concepts, a preparation checklist, and an expanded FAQ close the guide.

1. Introduction: What This Machine Learning Interview Guide Covers

There is almost no practical, candid information available on what it actually takes to secure a Research Scientist or Machine Learning Engineer position at a top AI lab. This guide aims to fill that gap - written from firsthand experience running a full interview process across multiple frontier AI labs and AI-focused companies, combined with structured research into primary-source papers and publicly available preparation material.

A few honest context notes upfront. Some processes conclude after an offer has already been signed elsewhere - that is normal and does not mean those companies were uninterested. A rejection at one track (e.g. Research Scientist) sometimes comes with a redirect to a different track (e.g. an Engineering or Applied role) at the same company. And applications to several strong-fit companies will not progress past the initial screen at all, even with a strong profile - résumé and applicant-tracking-system screening is noisy at high-volume employers, and research from Harvard Business School and Accenture on hiring funnels documents how qualified candidates are routinely filtered out for reasons that have little to do with their actual ability.

This guide covers machine learning interview preparation end-to-end: how to get the first call, how to prepare for every technical round, which topics to study (deep learning, transformers, LLMs, reinforcement learning, distributed training, ML system design), how to handle logistics and negotiation, and how to make the final decision. Specific question banks for each interview type are included throughout.

Key Insight: The process is inherently stochastic. Offers have come from companies where interviews felt rocky, and rejections from companies where interviews felt excellent. Calibrate your emotional response accordingly - any individual outcome is only weakly predictive of ability.

If you need to first build the underlying knowledge before tackling interview prep, our Complete ML Learning Roadmap covers each subject area in depth across a structured 12–18 month path from beginner to interview-ready.

2. Machine Learning Interview Preparation Checklist

Use this checklist to track readiness across every dimension of the ML interview process. Candidates who arrive at first-round interviews without completing the coding and implementation items are consistently underprepared, regardless of research depth.

2.1 Application and Pipeline Readiness

  • CV updated with accurate, specific descriptions of research contributions
  • Target companies identified and prioritised (at least 8–12 applications)
  • Cold email templates drafted for key hiring managers
  • LinkedIn and Google Scholar profiles current and consistent
  • Spreadsheet set up to track company, stage, deadlines, and contacts
  • At least one warm referral activated per target company (where possible)

2.2 Algorithms Coding Readiness

  • Blind 75 completed with solutions understood (not just memorised)
  • NeetCode 150 Mediums completed (or equivalent ~150 Medium problems)
  • Core patterns fluent: DFS, BFS, graphs, dynamic programming, backtracking, binary search, two pointers, sliding window
  • Able to solve a LeetCode Medium in under 20 minutes consistently
  • Optimal time complexity known for each core pattern (not just a working solution)

2.3 ML Coding and Implementation Readiness

  • Scaled dot-product attention implemented from scratch (PyTorch or JAX)
  • Multi-head attention implemented from scratch with correct head split/merge
  • Full transformer block implemented (encoder, decoder, or decoder-only)
  • Flash Attention algorithm understood at implementation level
  • Attention backward pass derived from first principles
  • MLP forward pass and manual backpropagation implemented
  • Training loop with gradient clipping written without reference
  • Common debugging scenarios practised: gradient explosion, NaN losses, shape mismatches
  • LoRA implementation understood (low-rank decomposition mechanics)

2.4 ML Theory and Knowledge Readiness

  • Flashcards written (not downloaded) for all 90+ topics in the reference list
  • RLHF pipeline understood end-to-end (reward model, PPO, GRPO)
  • Diffusion model forward and reverse processes explained without notes
  • Scaling laws, MoE, and positional encoding variants (RoPE, sinusoidal) reviewed
  • Distributed training paradigms covered: DDP, FSDP, tensor parallelism, pipeline parallelism
  • At least one ML system design question practised end-to-end
  • Research papers on CV read and opinions prepared (at least 3 recent papers)

2.5 Logistics and Process Readiness

  • Interview schedule sequenced: lower-priority companies first
  • All active processes disclosed to each company (standard practice)
  • On-demand coding tests held until most target companies have scheduled
  • Competing offers understood well enough to negotiate with specifics
  • RSU vs stock option mechanics understood before any offer arrives
  • Pre-interview routine established (sleep, exercise, cognitive reset)

3. How to Get Machine Learning Interviews at Top AI Labs

Getting past the application stage is its own challenge. The main levers are well-known: more papers, trendier research topics, and stronger internship experience. A rough benchmark that holds across top labs: 3 or more first-author papers published at ICLR, NeurIPS, or ICML, plus at least one internship or prior industry role, is typically the threshold for consistent callbacks. Work in high-demand areas - large language models, RLHF, diffusion models, and multimodal AI - generates substantially more traction than equally strong work in less fashionable areas.

The pivot that matters most: Once interviews are secured, more papers provide no additional advantage - interviewers frequently do not read a candidate's CV in detail. The preparation focus should shift entirely to interview skills, starting immediately rather than after the next paper is submitted. See Algorithms Coding and ML Coding & Implementation for where that time is best spent first.

3.1 Application Channels That Actually Work

LinkedIn and X (Twitter): Many labs - especially for internships - advertise roles that require filling out a linked Google Form to be considered. Simply clicking "Apply" on LinkedIn Jobs itself is often not sufficient. Following researchers and hiring managers at target companies is the most reliable way to catch these posts before they fill.

Referrals: Helpful, but not necessary. At competitive labs, referred and unreferred candidates both receive interview invitations - referrals can accelerate timelines and increase visibility, but their absence should not prevent an application. If a connection exists at a target company and no progress has been made after applying directly, asking for a referral is worth attempting.

Cold emails: Emailing a hiring manager or team member directly is frequently appreciated. The email should not simply restate a CV - it should explain specifically why the candidate is a strong fit for that team and what genuinely excites them about the work. A well-targeted cold email to a frontier AI lab can and does get a direct, personal reply from a hiring manager often enough to be worth the effort. Even in cases where the email goes unanswered, interviews sometimes still proceed - but the outreach creates a useful secondary signal of genuine interest.

Cover letters: Rarely required, but worth doing properly when they are. Cover letters generated wholesale by an AI model are easy to identify and make a poor impression. The better approach: write the letter authentically, then use a tool to polish the prose. Personality and genuine excitement are what differentiate these letters.

3.2 Research Scientist at a Startup vs a Large AI Lab

FactorLarge AI Lab (DeepMind, Anthropic, Meta AI, OpenAI)AI Startup (seed to Series C)
DiscoverabilityRoles are public and well-known; competition is extremely highHarder to find - word of mouth is the primary channel; lower competition
Research WorkConsistently high-quality; access to large compute resourcesCan be world-class or engineering-focused; agenda may shift frequently
Growth & VisibilityOne of many researchers; slower path to ownershipHigh visibility; faster growth and influence over research direction
CompensationSalary plus RSUs with relatively clear liquidity pathSalary plus stock options - upside possible but should be heavily discounted
CV SignalImmediately recognised globallyRequires explanation; less portable unless company achieves prominence
Compute AccessLarge-scale GPU/TPU clusters for pretraining and experimentsVariable - may be constrained; cloud credits common at early stage

4. Machine Learning Interview Structure: What to Expect at Each Stage

Most top AI labs and ML-focused tech companies follow a broadly similar interview structure, though the weight and difficulty given to each stage varies considerably. Expect 3–8 technical interviews depending on the lab. The rounds below represent the full spectrum - not every company uses all of them.

#RoundDurationWhat Is Tested
01Recruiter Screen30–45 minBackground, motivation, role fit, basic CV probing
02Algorithms Coding45–60 minLeetCode-style Mediums: DFS/BFS, DP, binary search, two pointers
03ML Coding & Debugging45–60 minImplementing ML components from scratch; debugging existing code
04ML Theory & Knowledge45–60 minDeep learning, transformers, LLMs, RL, distributed training concepts
05Behavioural / Research Discussion30–60 minSTAR stories, research interests, opinions on frontier work
06ML System Design45–60 minOpen-ended design of ML pipelines, infrastructure, and serving systems
Underestimated round: Behavioural and research discussion rounds feel casual compared to technical interviews. Underestimating them is a common and avoidable mistake. Prepare concrete STAR stories and genuine research perspectives in advance. Interviewers at top labs are often senior researchers - shallow opinions on frontier topics are immediately apparent.
The seven-stage machine learning interview pipelineA flow diagram showing the typical path through an ML interview process: Recruiter Screen, Algorithms Coding, ML Coding and Debugging, and ML Theory and Knowledge along the top row, continuing down into Behavioural and Research Discussion, ML System Design, and finally Offer and Negotiation along the bottom row. Each box is a clickable link to the matching section of this guide.01Recruiter Screen30–45 min02Algorithms Coding45–60 min03ML Coding &Debugging · 45–60 min04ML Theory &Knowledge · 45–60 min05Behavioural &Research · 30–60 min06ML System Design45–60 min07Offer &Negotiation
Figure 1. The end-to-end ML interview pipeline. Each box links to the matching section of this guide.

5. Algorithms Coding Interview

The algorithms coding round tests LeetCode-style problem solving at Medium difficulty. This round is separate from ML coding - it evaluates pure data structures and algorithms fluency and is a prerequisite gate at most top labs.

5.1 Core Patterns to Master

  • Graph Traversal (DFS / BFS): Connected components, shortest paths, cycle detection
  • Dynamic Programming: 1D and 2D DP, memoisation vs tabulation, common templates
  • Binary Search: Search on answer, rotated arrays, finding boundaries
  • Two Pointers: Sorted arrays, palindromes, container problems
  • Sliding Window: Variable and fixed windows, frequency counting
  • Backtracking: Permutations, combinations, N-queens variants
  • Heap / Priority Queue: Top-k problems, merge k sorted lists
  • Trees: Traversals, LCA, diameter, serialisation

5.2 Most Common ML Interview Coding Questions

The questions below appear across virtually every ML interview process at top labs. Treat each as a starting point for deeper pattern exploration.

Machine Learning Fundamentals Questions

  • Explain the bias-variance tradeoff. How does it relate to model complexity?
  • What is the difference between L1 and L2 regularisation? When would you use each?
  • Explain backpropagation from first principles. Derive the gradient for a two-layer network.
  • What is the vanishing gradient problem, and how do residual connections address it?
  • Explain batch normalisation, layer normalisation, and RMSNorm. What are the differences?
  • What is the difference between MLE and MAP estimation? Give a concrete example.
  • Explain precision, recall, F1, and AUC-ROC. When does each metric matter most?
  • What is KL divergence? How is it used in VAEs?
  • Explain the Adam optimiser. What does it do that SGD with momentum does not?
  • What is the curse of dimensionality? How does it affect nearest-neighbour search?

Research Scientist Interview Questions

  • Walk me through your most significant research contribution. What made it non-obvious?
  • What open problem in machine learning do you find most interesting right now, and why?
  • You have a negative result in your experiments. How do you decide whether to change the hypothesis or the method?
  • Describe a time you disagreed with a collaborator on a research direction. How was it resolved?
  • How do you approach reproducing results from a paper that lacks sufficient implementation detail?
  • What is your opinion on the current direction of scaling language models? Is scaling sufficient for AGI?
  • Pick a recent paper you found compelling. What would you do next to extend it?

5.3 Preparation Strategy

Complete Blind 75 first - these cover the essential patterns. Then work through approximately 75 additional NeetCode Mediums with a focus on Dynamic Programming and Graphs, which appear most frequently in ML lab interviews. Target 20 minutes or fewer per Medium problem. If stuck beyond 15 minutes, look up the solution, understand the pattern, flag it for review, and move on. Breadth across patterns matters more than deep mastery of any single type.

6. ML Coding and Implementation Interview

The ML coding round tests two distinct skills: implementing ML components from scratch with correct logic and tensor shapes, and debugging existing ML code efficiently. Both require deep familiarity with PyTorch or JAX at the level of understanding what every line does - not just using high-level APIs.

6.1 Implementation Questions

  • Implement scaled dot-product attention from scratch. Include the masking logic for causal attention.
  • Implement multi-head attention. Correctly handle the split and merge of heads.
  • Implement a full transformer decoder block (self-attention + FFN + residuals + layer norm).
  • Implement Flash Attention at an algorithmic level. Explain why it is memory-efficient.
  • Implement the backward pass for softmax from first principles.
  • Implement a simple training loop with gradient clipping in PyTorch. Include a learning rate scheduler.
  • Implement LoRA (Low-Rank Adaptation). Explain how it reduces trainable parameters.
  • Implement k-means clustering from scratch. Discuss convergence and initialisation.
  • Implement dropout correctly during training and inference.
  • Implement sinusoidal and RoPE positional encodings.

6.2 Reference Implementations

# Scaled dot-product attention - from scratch
import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Args:
        Q, K, V: (batch, heads, seq_len, d_k)
        mask:    optional causal or padding mask (bool tensor)
    Returns:
        output:  (batch, heads, seq_len, d_k)
        weights: (batch, heads, seq_len, seq_len)
    """
    d_k = Q.size(-1)
    scores = Q @ K.transpose(-2, -1) / d_k ** 0.5   # (B, H, T, T)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    weights = F.softmax(scores, dim=-1)
    return weights @ V, weights                       # (B, H, T, d_v)
# LoRA: Low-Rank Adaptation of a linear layer
import torch.nn as nn

class LoRALinear(nn.Module):
    def __init__(self, in_features, out_features, rank=4, alpha=1.0):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        self.A = nn.Parameter(torch.randn(rank, in_features) * 0.01)
        self.B = nn.Parameter(torch.zeros(out_features, rank))
        self.alpha = alpha
        self.rank  = rank

    def forward(self, x):
        base = x @ self.weight.T
        lora = x @ self.A.T @ self.B.T
        return base + (self.alpha / self.rank) * lora
# Multi-head attention - complete module
import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        assert d_model % n_heads == 0
        self.n_heads = n_heads
        self.d_k     = d_model // n_heads
        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x, mask=None):
        B, T, C = x.shape
        # Project and reshape to (B, H, T, d_k)
        Q = self.W_q(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        # Scaled dot-product attention
        scores = Q @ K.transpose(-2, -1) / self.d_k ** 0.5
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn   = F.softmax(scores, dim=-1)
        out    = attn @ V                              # (B, H, T, d_k)
        # Merge heads and project
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.W_o(out)

6.3 Debugging Questions

  • This training loop is producing NaN losses after 200 steps. What are the most likely causes and how do you diagnose them?
  • Training loss is decreasing but validation loss is flat. What is happening and what would you change?
  • This attention implementation produces shapes (B, H, T, T) but the output is wrong. Find the bug.
  • Gradient norms are exploding after 50 steps. What would you check first?
  • A PyTorch model trains correctly on CPU but produces wrong outputs on GPU. What could cause this?

6.4 ML Implementation Checklist

  • Full transformer architecture from scratch - reference nanoGPT
  • Causal self-attention, cross-attention, and multi-head attention
  • Flash Attention - algorithm-level understanding and implementation rationale
  • Attention backward pass from first principles
  • MLP forward pass and manual backpropagation
  • LoRA implementation (low-rank decomposition of weight updates)
  • Training loop with gradient clipping in PyTorch or JAX
  • Debugging: gradient explosion, NaN losses, shape mismatches, training loop bugs
  • RLHF pipeline: reward model training, PPO loop, KL penalty
  • Diffusion model forward process (noise schedule) and reverse process (DDPM/DDIM)

7. Deep Learning Theory Interview Questions

Deep learning questions test both conceptual understanding and the ability to reason from first principles. Interviewers at top AI labs expect derivations, not just definitions.

7.1 Architecture and Training

  • Explain the residual connection in ResNets. Why does adding the identity shortcut help training?
  • Derive the backpropagation update rule for a two-layer neural network with cross-entropy loss.
  • Explain why weight initialisation matters. Describe Xavier and He initialisation and when each is used.
  • What is the difference between batch normalisation and layer normalisation? Why do transformers use layer norm?
  • Explain how dropout acts as a regulariser. What is it approximating at test time?
  • What causes the exploding and vanishing gradient problems? How do gradient clipping and residual connections address them?
  • Explain the difference between RNNs, LSTMs, and GRUs. What limitation of RNNs do LSTMs solve?
  • Explain convolutional neural networks. What is the inductive bias that makes them effective for image data?
  • What is the difference between online, mini-batch, and full-batch gradient descent? What are the tradeoffs?

7.2 Optimisation

  • Derive the Adam update rule. What do the first and second moment estimates represent?
  • What is the learning rate warmup and why is it important for transformer training?
  • Explain cosine annealing with warm restarts. When is it preferred over a fixed learning rate?
  • What is gradient accumulation? When and why would you use it?
  • Explain mixed precision training (BF16 vs FP16 vs FP32). What are the stability differences?

7.3 Generative Modelling

  • Explain the VAE objective (ELBO). What is the role of the KL term and the reconstruction term?
  • Explain how GANs work. What is the minimax objective? What training instabilities can arise?
  • Explain diffusion models. What happens in the forward process? What does the model learn to predict in the reverse process?
  • What is DDIM? How does it differ from DDPM in sampling?
  • Explain flow matching. What is the ODE it solves and how does it relate to diffusion?
  • What is classifier-free guidance and why does it improve sample quality?

8. Transformer and LLM Interview Questions

Transformer architecture questions are among the most heavily tested in ML interviews at research-focused labs. Interviewers expect candidates to understand transformers at implementation level - not just at the level of calling nn.MultiheadAttention. See our companion Transformers Explained article for textbook-level derivations of each component.

8.1 Architecture Questions

  • Explain the scaled dot-product attention mechanism. Why is the scaling by √d_k necessary?
  • What is multi-head attention? What does each head learn to attend to?
  • Explain the difference between encoder-only, decoder-only, and encoder-decoder transformer architectures. Give examples of each.
  • What is causal (masked) self-attention? Why is it necessary for autoregressive language models?
  • Explain cross-attention. In what architectures is it used?
  • What is the role of the feed-forward sublayer in a transformer block? What is the typical expansion ratio?
  • Explain sinusoidal positional encodings. Why are they used rather than learned embeddings?
  • Explain RoPE (Rotary Position Embedding). What advantage does it have for handling longer context?
  • What is Flash Attention? Explain the tiling algorithm and why it reduces memory usage from O(n²) to O(n).
  • What is Transformer-XL? How does it handle long-range dependencies beyond a fixed context window?
  • What is the Griffin architecture? How does it combine recurrence and attention?
Decoder-only transformer block architectureA diagram of a single decoder-only transformer block, from bottom to top: input token embeddings combined with positional encoding, flowing into masked multi-head self-attention, then a residual add and layer-norm step, then a feed-forward sublayer, then a second residual add and layer-norm step, producing the block output. A residual connection arrow skips around each sublayer.Token Embeddings+ Positional Encoding (RoPE)Masked Multi-HeadSelf-Attention (causal mask)Add & LayerNormFeed-Forward NetworkLinear → GELU → Linear (4× expand)Add & LayerNormBlock Output→ next block / output headresidualresidual
Figure 2. A single decoder-only transformer block - the unit interviewers most often ask candidates to implement from scratch (see ML Coding & Implementation).

8.2 LLM Training and Post-Training Alignment

  • Explain RLHF (Reinforcement Learning from Human Feedback) end-to-end. What are the three training stages?
  • What is the reward model in RLHF? How is it trained? What are its failure modes?
  • Explain PPO as used in RLHF fine-tuning. What does the KL penalty term do?
  • What is DPO (Direct Preference Optimisation)? How does it differ from PPO-based RLHF?
  • What is GRPO? How does it differ from PPO in the context of reasoning models?
  • Explain instruction tuning (SFT). What data format is used and what does it teach the model?
  • What is the Chinchilla scaling law? What was its key finding about compute-optimal training?
  • Explain KV cache. Why is it used during inference and what memory tradeoffs does it create?
  • What is speculative decoding? How does it improve inference throughput?
  • What is quantisation (INT8, INT4)? What accuracy tradeoffs does it introduce?
  • Explain context length extension techniques: RoPE/position interpolation, ALiBi, and YaRN.
  • What are hallucinations in LLMs? What are the known mechanisms and mitigations?
  • Explain retrieval-augmented generation (RAG). What problems does it solve and what are its limitations?

8.3 Training and Efficiency

  • What are scaling laws for transformer language models? What do they predict about the optimal balance of compute, data, and parameters?
  • Explain Mixture of Experts (MoE). How does conditional computation reduce cost per token?
  • What is LoRA? Derive why it reduces the number of trainable parameters during fine-tuning.
  • Explain tokenisation. What is BPE and why is it used for LLM vocabularies?
  • What are the standard decoding strategies for language models? Compare greedy, beam search, top-k, and nucleus sampling.

9. Reinforcement Learning Interview Questions

Reinforcement learning questions are heavily tested at labs whose work intersects with alignment, reasoning, robotics, and game-playing AI. Candidates with RL research backgrounds should expect depth; those from other areas should ensure solid coverage of fundamentals and the RLHF pipeline.

9.1 Foundations

  • Define a Markov Decision Process (MDP). What are the components (S, A, R, P, γ)?
  • What is the Bellman equation? Derive the Bellman optimality equation for Q-values.
  • Explain the difference between on-policy and off-policy learning. Give an example of each.
  • What is temporal difference (TD) learning? How does it differ from Monte Carlo methods?
  • Explain Q-learning. Why is it off-policy? What is its convergence guarantee?
  • Explain the policy gradient theorem. Derive the REINFORCE update rule.
  • What is the credit assignment problem? How do algorithms like GAE address it?
  • Explain the exploration vs exploitation tradeoff. What strategies exist beyond ε-greedy?

9.2 Modern Algorithms

  • Explain PPO (Proximal Policy Optimisation). What is the clipped surrogate objective and why is clipping necessary?
  • Explain GAE (Generalised Advantage Estimation). What does the λ parameter control?
  • Explain Soft Actor-Critic (SAC). What is the maximum entropy objective and what problem does it solve?
  • What is GRPO? How is it used in reinforcement learning for reasoning in LLMs?
  • Explain actor-critic methods. How do they combine policy gradient and value function learning?
  • What is importance sampling? How is it used in off-policy policy gradient methods?
  • Explain model-based RL. What are the tradeoffs compared to model-free approaches?
  • Explain MuZero. How does it perform planning without access to the true environment dynamics?
  • What is curriculum learning in RL? How is it used to train agents on complex tasks?

10. ML System Design Interview

ML system design rounds are more common in ML Engineer and Applied Scientist interviews than in pure Research Scientist tracks, but they appear across all roles at scale-focused labs. The format is open-ended: a problem is stated, and the candidate is expected to drive the design. Communication, reasoning about tradeoffs, and awareness of real-world constraints matter as much as technical depth.

10.1 Common ML System Design Questions

  • Design a system to train a large language model across hundreds of GPUs. What parallelism strategies would you use?
  • Design a recommendation system for a platform with 100M users and 10M items. How do you handle cold start?
  • Design an online learning system that updates a model in real time from user feedback. What are the key failure modes?
  • Design a retrieval-augmented generation (RAG) system for a legal document assistant. How do you handle retrieval latency?
  • Design a model serving infrastructure for a 70B parameter LLM with a P95 latency target of 500ms. What optimisations would you apply?
  • Design a distributed training pipeline that minimises communication overhead between nodes. Compare DDP, FSDP, and tensor parallelism.
  • You are training a diffusion model and GPU memory is the bottleneck. What techniques would you apply?
  • Design a data pipeline for pretraining a 7B parameter language model. How do you handle data quality, deduplication, and filtering at scale?

10.2 Key Distributed Training Concepts

Four distributed training strategies across four GPUsFour panels, each showing four GPUs. Data Parallel: every GPU holds a full copy of the model. Fully Sharded Data Parallel: each GPU permanently stores only one quarter of the parameters, gradients, and optimiser states. Tensor Parallel: a single large weight matrix is split into four column-wise slices, one per GPU. Pipeline Parallel: different contiguous groups of layers live on different GPUs, with activations passed forward between them.Data Parallel (DDP)GPU0GPU1GPU2GPU3Full model replicated on every GPU -gradients AllReduced each stepFully Sharded (FSDP)GPU0GPU1GPU2GPU3Each GPU permanently stores 1/4 of params,gradients & optimiser state; rest AllGathered on demandTensor Parallelweight matrix W, split by columnsGPU0GPU1GPU2GPU3One large matrix split column-wise across GPUs -AllReduce after every layerPipeline ParallelL1–3L4–6L7–9L10–12GPU0GPU1GPU2GPU3Different layer groups per GPU -activations passed forward via P2P, micro-batched
Figure 3. How each parallelism strategy partitions a model across four GPUs. Full mechanics and tradeoffs are in the table below.
StrategyWhat Is ParallelisedMemory SavingsCommunication OverheadBest Used When
Data Parallel (DDP)Data batches across GPUsNone (model replicated)Gradient AllReduce each stepModel fits on one GPU
Fully Sharded (FSDP)Parameters, gradients, optimiser statesHigh - linear in world sizeAllGather + ReduceScatterModel too large for one GPU
Tensor ParallelIndividual weight matricesHigh - linear in TP degreeAllReduce within each layerVery large single layers (attention, FFN)
Pipeline ParallelModel layers across stagesModerateP2P activation passingVery deep models with many layers

11. Behavioural and Research Discussion Round

Two types of round appear here: classic behavioural (STAR-format conflict, feedback, and collaboration questions) and research discussion (interests, field vision, and opinions on recent work). Both are underestimated compared to technical rounds - prepare for each with the same rigour.

11.1 Behavioural Questions (STAR Format)

  • Tell me about a time you disagreed with a colleague or manager on a technical decision. How did you handle it?
  • Describe a situation where you had to deliver critical feedback to a peer. How did you approach it?
  • Tell me about a project that failed. What happened and what did you take away?
  • Describe a time you had to learn a new technical area quickly. How did you approach it?
  • Tell me about a time you had competing priorities. How did you decide what to focus on?
  • Describe your most significant collaborative research contribution. What was your specific role?

11.2 Research Discussion Questions

  • What open problem in ML do you find most interesting right now, and why?
  • What is your opinion on the current direction of scaling language models?
  • Walk me through your most significant research contribution. What made it non-obvious?
  • Pick a recent paper you found compelling. What would you do next to extend it?
  • How do you approach reproducing results from a paper that lacks sufficient implementation detail?
  • You have a negative result in your experiments. How do you decide whether to change the hypothesis or the method?

12. Understanding Compensation: RSUs vs Stock Options

Equity compensation is frequently misunderstood at the point of offer evaluation. The distinction between RSUs and stock options matters considerably - particularly under UK law and taxation - and candidates who do not understand the mechanics are consistently at a disadvantage during negotiation.

At a glance
 RSUsStock Options
What you getActual shares, on vestingThe right to buy shares at a fixed price
Typical atLarge, established companiesEarly- to mid-stage AI startups
Value if stock fallsStill worth somethingCan be worth nothing
Tax triggerAt vesting (ordinary income)At exercise, then again at sale
Cash needed upfrontNoneOften yes - to exercise and pay tax

RSUs (Restricted Stock Units) - typical at large tech companies such as Google DeepMind, Meta, and Microsoft - grant actual shares in the company on a vesting schedule. When they vest, shares can be sold or held. A portion - often around half - is automatically sold to cover income tax, because vested RSUs are treated as ordinary income.

Stock options - typical at AI startups - grant the right to purchase shares at a fixed strike price X, regardless of the market price Y at exercise. If Y > X, exercising and selling generates a profit. If Y < X, the options are worthless. In the UK, options granted under a qualifying Enterprise Management Incentive (EMI) scheme carry meaningfully better tax treatment than unapproved options - confirm with the offering company which type is being granted before modelling the numbers below.

Warning: Stock options typically expire 90 days after leaving the company. If the company is not yet public, shares cannot be sold after exercising - meaning the full purchase cost must be paid in cash, with income tax owed on the paper gain immediately, before any cash has been received.

To make the mechanics concrete, consider a UK stock option scenario:

ItemValue
Strike price (X)£10 per share
Current valuation price (Y)£50 per share
Options granted10,000
Cost to exercise£100,000
Paper gain subject to income tax£400,000
Income tax owed (45% rate)£180,000
Total cash outlay before seeing a penny£280,000

When a recruiter quotes a total compensation figure that includes startup equity, apply a significant mental discount to that number. Cashless exercise options and liquidity events can reduce this burden, but each new funding round dilutes existing shares, and liquidity events typically value stock below the official company valuation.

13. How to Negotiate a Machine Learning Job Offer

Companies consistently have more room to improve initial offers than candidates typically expect. Negotiating is always worth attempting - the downside is essentially zero, and material increases of 10–30% on total compensation are common when handled correctly.

The "blind auction" approach - not revealing competing offers to preserve leverage - frequently does not work in practice at top AI labs. Many companies explicitly request proof of competing offers before authorising increases, and some will verify the details. Sharing offers openly, when asked, is usually the more effective strategy.

"Recruiters are skilled at reading genuine preferences from small signals - how often a candidate mentions a company, the tone they use when discussing it. If a recruiter knows their company is already the top choice, negotiating leverage is significantly reduced."

13.1 Practical Negotiation Notes

  • Deadlines range from one week to two weeks or longer and are not always flexible - though some companies will extend for the right candidate.
  • Companies maintain historical data on candidate decisions. A competing offer from a peer lab (OpenAI, Anthropic, Google DeepMind, Meta AI) carries genuine weight because the data shows it is a real alternative.
  • An offer from a company that candidates rarely choose over the target lab does not create meaningful leverage, regardless of the headline number.
  • Negotiate base salary, signing bonus, and RSU grant separately - each has different flexibility.
  • Tell every company about other active processes. This is standard practice, keeps timelines clear, and prompts companies to move faster when they are interested.

14. How to Choose Between Machine Learning Job Offers

Watch for this trap: The temptation to accept an early offer out of fear that nothing better will materialise is a predictable psychological pattern - particularly for candidates who have not been through the process before. Better offers do exist further along the process, and accepting prematurely closes those options. See Negotiation for how to buy time on a deadline without burning goodwill.

14.1 Factors Worth Weighting Heavily

  • Research agenda alignment: Can you see yourself excited by this team's direction in two years, not just today?
  • Team quality: Speak with at least two team members who are not the hiring manager before deciding.
  • Compute access: For research roles, GPU and TPU access is a direct constraint on what research is possible.
  • Equity mechanics: Understand whether equity is RSUs or options before signing - see Compensation for the mechanics.
  • Publication norms: Some labs publish aggressively; others are more closed. Know which environment suits you.
  • Location and life: Career decisions that work against life outside of work compound negatively over time.
Whose opinion to weight most: The most reliable input at the final stage is the perspective of people who know the candidate well - not the people at the companies being evaluated, who are predictably biased toward their own employers. Talking through the decision with trusted individuals who understand your actual values and goals tends to surface clearer answers than further research on the companies themselves.

15. Preparing Mentally for the ML Interview Process

The machine learning interview process at top AI labs is genuinely demanding, and its cumulative effect on mental resilience should not be underestimated - even for candidates who handle pressure well in other contexts.

15.1 Practical Strategies for Sustaining Performance

  • Sleep: Sleep disruption the night before high-stakes interviews is common and compounds significantly across multiple same-week interviews. Prioritise a consistent sleep routine in the weeks before the process begins - not under pressure on the eve of an interview.
  • Physical preparation: Exercise before interviews - particularly cardiovascular activity - reduces nervous energy and resets cognitive state. Keep intensity easy and eat sufficient carbohydrates beforehand.
  • Social connection: Isolation during an intense interview period is counterproductive. Structure at least some social time on evenings without early-morning interviews.
  • Pre-interview ritual: A consistent pre-interview routine provides a reliable anchor against unpredictable anxiety. The specific content matters less than the consistency.
Keep this in view: A candidate's worth as a researcher is not determined by these interviews. The process contains enough randomness that a single poor round - even on a well-known topic - does not constitute meaningful evidence of ability (see the Key Insight in the introduction). Emotional preparation is most effective when done before the process begins rather than discovered under fire.

15.2 Recommended Reading on Mindset and Performance

BookWhy It Helps
The Now Habit - Neil FiorePractical approach to procrastination and performance anxiety rooted in psychological research. Particularly useful for candidates who procrastinate on interview preparation.
Mindset - Carol DweckThe foundational text on growth versus fixed mindset. Directly applicable to navigating rejection during an extended interview process.
The Gifts of Imperfection - Brené BrownUseful for the underlying work of decoupling self-worth from professional outcomes - which becomes important when facing multiple simultaneous rejections.

16. Full Machine Learning Interview Topic Reference

The topics below represent a comprehensive review list compiled across a complete interview process at top AI labs. Nearly every topic on this list appeared in at least one interview in some form.

CategoryTopics
Reinforcement LearningQ-Learning / TD Learning, Bellman Equations, PPO, GRPO, GAE, Variance Reduction, DPO, Policy Gradient Theorem, On-Policy vs Off-Policy, Exploration vs Exploitation, Credit Assignment, MuZero / World Models, AlphaGo / AlphaZero, SAC, Model-Based vs Model-Free, MDP, Monte Carlo vs TD, Actor-Critic, SARSA, Importance Sampling, Curriculum Learning
Large Language ModelsFlash Attention, LoRA, TransformerXL, Griffin, Perceiver, Scaling Laws & Chinchilla, Mixture of Experts, RoPE, Sinusoidal Embeddings, Relative Positional Embeddings, LLM vs RNN vs S4 vs Mamba, Tokenisation (BPE), Pretraining & SFT, RLHF, Decoding Techniques, Causal Attention, Cross Attention, KV Cache, Speculative Decoding, Quantisation
Generative ModellingGANs, VAEs and ELBO, Score Function / Score Matching, Diffusion Forward Process, Diffusion Reverse Process (DDIM / DDPM), Diffusion SDE, Flow Matching ODE, Classifier-Free Guidance
Distributed SystemsTensor Parallelism, FSDP, DDP, Pipeline Parallelism, AllReduce / AllGather, Mixed Precision (BF16), Gradient Checkpointing, Gradient Accumulation & Clipping, Numerical Precision, JIT Compiling, JAX / PyTorch / TensorFlow
General MLCurse of Dimensionality, S4 / CNNs / RNNs / LSTMs, Autoencoders, Gumbel-Softmax, MLE vs MAP, Newton's Method, Linear Regression, Activation & Loss Functions, No Free Lunch Theorem, BatchNorm / LayerNorm / RMSNorm, Adam / AdamW / Adagrad, Bias-Variance Tradeoff, Backpropagation, Regularisation (L1, L2, Dropout), Clustering, KNN, SVMs, Boosting, Decision Trees, Bayes Theorem, Precision / Recall / F1 / AUC-ROC, KL / JS Divergence, Xavier / He Init, Overfitting / Underfitting, Transfer Learning, Few-Shot / Zero-Shot
Linear AlgebraPSD Matrices, Jacobian & Hessian, Eigenvectors / Eigenvalues, Matrix Inverse, Dot Product, Null Space / Image Space, Orthogonality, Linear Independence, Singular Matrices, Rank / Span, Determinant, SVD

17. What to Do Differently: Lessons from a Completed ML Interview Process

Even in a successful process - one that concluded with multiple competing offers from frontier AI labs and AI-focused companies - certain approaches would have produced better outcomes. These are the highest-value changes in retrospect.

  • Track everything in a spreadsheet from day one. Missing applications to genuinely interesting companies because of tracking failures is avoidable and frustrating. Use Notion or a simple Google Sheet with columns for company, current stage, deadlines, key contacts, and notes from each round.
  • Prepare emotionally before the process starts, not during it. The interview process has a way of feeling like a verdict on years of research. Developing a healthier relationship with failure and professional setbacks in advance is significantly more effective than attempting this work mid-process.
  • Be proactive about companies that go silent. If an application has not resulted in a reply and the company is genuinely interesting, a cold email to someone on the team is a more effective response than passive waiting.
  • Start ML implementation practice earlier. The gap between knowing how attention works conceptually and being able to implement it cleanly under time pressure is larger than most researchers expect. Four weeks of implementation practice is the minimum; eight weeks is better.
  • Sequence interviews by preference. Start with lower-priority companies to build calibration and confidence before processes that actually matter reach their final stages.
  • Manage timing deliberately. The goal is to have multiple offers arrive in the same window. If Company A offers an on-demand test, hold it until Company B's first interview is scheduled. Process timing is difficult to control precisely, but deliberate management is possible and meaningfully changes negotiation leverage.

Recommended Resources for ML Interview Preparation

TypeResourceWhat It Covers
BookDesigning ML Systems - Chip HuyenApplied ML, system design, and ML fundamentals at interview depth. Highlight as you read - it doubles as a flashcard source.
Book (Online)The JAX Scaling BookDistributed training, parallelism strategies (DDP, FSDP, tensor parallelism, pipeline parallelism), and large-scale ML systems.
BookReinforcement Learning - Sutton & BartoOnly necessary if new to RL. Skim chapters 1–6 and focus on policy gradient methods for interview coverage.
PracticeNeetCode 150Structured LeetCode practice covering all core algorithmic patterns. Start here before any lab interview.
PracticeDeepMLML coding practice problems including attention, backpropagation, and training loop exercises.
Code ReferencenanoGPT - KarpathyClean, minimal transformer implementation in PyTorch. Read and re-implement from scratch as a preparation exercise.
CourseOpenAI Spinning UpPractical deep RL introduction covering policy gradient, PPO, SAC, and TRPO with working implementations.

Frequently Asked Questions

  • Most candidates need 6–12 weeks of structured preparation. Spend the first 2 weeks on algorithms coding (LeetCode Mediums), the next 3–4 weeks on ML implementation and theory flashcards, and the final 2–3 weeks on company-specific mock interviews and ML system design. Candidates with weaker coding foundations should add 2–4 weeks of pure LeetCode practice first. Starting preparation before sending any applications is strongly recommended.
  • Transformer architecture and attention mechanisms - scaled dot-product attention, multi-head attention, causal masking, Flash Attention - are tested at virtually every lab. LLM training and alignment (RLHF, SFT, DPO, PPO, GRPO) is the second most consistently covered area. Deep learning fundamentals (backpropagation derivations, optimisers, normalisation schemes) and distributed training (DDP, FSDP, tensor parallelism) round out the most heavily tested themes.
  • A PhD is not strictly required, but it is the default expectation at most top labs (DeepMind, Anthropic, OpenAI, Meta AI) for Research Scientist roles. A strong publication record - 3 or more first-author papers at ICLR, NeurIPS, or ICML - can substitute in practice. ML Engineer roles are more accessible without a PhD but still demand strong implementation skills. Senior RS roles almost universally expect a PhD or equivalent research output.
  • The ML coding round is consistently identified as the most difficult by candidates who have not practised implementation. Writing a full transformer decoder from scratch - with correct tensor shapes throughout, causal masking, residual connections, and layer normalisation - under 45-minute time pressure is substantially harder than understanding the architecture conceptually. The gap between knowing how attention works and implementing it cleanly under time pressure is real and requires dedicated practice to close.
  • Complete at least Blind 75 and approximately 75 additional NeetCode Mediums, for a total of around 150 Medium problems. Hard problems appear occasionally but are rare; fluency with every Medium pattern is more valuable than sporadic Hard attempts. The core patterns tested are DFS/BFS, dynamic programming, binary search, two pointers, sliding window, and backtracking. Target 20 minutes or fewer per Medium as your benchmark.
  • The most commonly asked transformer questions are: explain scaled dot-product attention and why scaling by √d_k is necessary; what does multi-head attention provide that single-head cannot; explain causal masking for autoregressive language models; compare sinusoidal vs learned vs RoPE positional encodings; explain the Flash Attention tiling algorithm and its O(n) memory guarantee; explain Mixture of Experts and conditional computation; and describe the encoder-only / decoder-only / encoder-decoder architectures with concrete examples.
  • RLHF (Reinforcement Learning from Human Feedback) is the three-stage post-training alignment pipeline: (1) supervised fine-tuning on high-quality demonstrations, (2) reward model training on human preference comparison pairs, and (3) RL fine-tuning - typically PPO - against the reward model with a KL divergence penalty to prevent the policy from drifting too far from the SFT baseline. Interviewers expect end-to-end explanations of each stage, awareness of reward hacking failure modes, and the ability to compare PPO-based RLHF with DPO.
  • Yes. Companies consistently have more room to improve initial offers than candidates expect - particularly on base salary, signing bonus, and RSU grant. The downside risk of negotiating respectfully is essentially zero. Increases of 10–30% on total compensation are achievable when you hold competing offers and communicate genuine interest clearly. Never accept the first number without at least asking whether there is flexibility.
  • Research Scientists at top labs are expected to independently identify research problems, design experiments, and contribute publishable findings. The interview focuses heavily on research depth, publication record, and intellectual curiosity. ML Engineers focus on building, scaling, and productionising ML systems. The interview is more implementation and systems-heavy, with greater emphasis on ML system design, distributed training, and software engineering. Many labs hire both tracks and the distinction narrows at senior levels.
  • ML system design rounds are open-ended: a problem is stated (e.g. "design a recommendation system for 100M users") and the candidate drives the discussion. The interviewer evaluates problem decomposition, awareness of real-world constraints, ability to reason about tradeoffs between approaches, and communication quality. Common questions include: designing training infrastructure for large models, recommendation and retrieval systems, RAG pipelines, online learning systems, and model serving for large LLMs with latency SLAs.
  • Flash Attention is an IO-aware implementation of scaled dot-product attention that tiles the computation across GPU SRAM to avoid materialising the full N×N attention matrix in HBM (GPU memory). Standard attention has O(N²) memory complexity; Flash Attention reduces this to O(N) by computing softmax and the output in fused passes over tiles. The result is 2–4× speedup and a proportional memory saving with mathematically identical outputs. It is important in interviews because it appears on almost every ML coding and theory question list at research labs.
  • LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that freezes the pretrained weight matrix W and adds a low-rank update ΔW = BA, where B ∈ R^(d×r) and A ∈ R^(r×k) with rank r ≪ min(d, k). During fine-tuning, only A and B are trained, reducing the number of trainable parameters from d×k to r×(d+k). For a d=4096 projection with r=8, LoRA reduces trainable parameters by ~256×. At inference, the update can be merged back into W with no added latency.
  • PPO (Proximal Policy Optimisation) is an online RL algorithm that trains the language model policy against a separately trained reward model, using a clipped surrogate objective and a KL penalty to prevent instability. DPO (Direct Preference Optimisation) re-parameterises the RLHF objective to eliminate the need for an explicit reward model and RL loop: it directly optimises the policy on preference pairs using a binary cross-entropy loss. DPO is simpler to implement and more stable, but it is offline and may be less effective when the preference data distribution differs from the model's current distribution.
  • The four primary paradigms are: Data Parallel (DDP) - each GPU holds a full model copy and gradients are AllReduced; Fully Sharded Data Parallel (FSDP) - parameters, gradients, and optimiser states are sharded across GPUs for large-model training; Tensor Parallelism - individual weight matrices are split across GPUs (requiring AllReduce within each layer); and Pipeline Parallelism - model layers are split across stages with micro-batch interleaving to hide bubble overhead. Know when to use each and the communication primitives each relies on (AllReduce, AllGather, ReduceScatter, P2P). See the ML System Design section for a side-by-side comparison table.
  • A general SE interview at most tech companies is almost entirely algorithms coding plus a system design round. An ML research scientist interview adds three rounds that SE loops do not have: ML coding (implementing model components from scratch, not calling library functions), ML theory (derivations, not definitions), and a research discussion round where a senior researcher probes the depth and originality of past work. The algorithms coding bar is usually similar or slightly lower than at FAANG-style SE interviews, but the combined technical surface area is larger. ML Engineer interviews sit in between - heavier on ML system design and production concerns, lighter on open-ended research discussion.
  • Write down what went wrong while it is fresh - the specific question, where the explanation broke down, and what the correct approach was - then close the laptop and stop reviewing it for the rest of the day. Rehashing a single round rarely improves the next one and reliably damages confidence going into it. A single weak round is not a reliable predictor of the final decision: interview panels typically weight the round average and look for consistent strengths, not a single low score in isolation. If a debrief or feedback call is offered, take it - specific, dated feedback is one of the few high-signal artifacts available during the process.
  • Less common at the largest labs (DeepMind, OpenAI, Anthropic, Meta AI), more common at mid-size AI companies and startups, where they often replace a live ML coding round entirely. Typical formats: a small modelling task with a provided dataset and a write-up, or a focused implementation exercise (e.g., implement and train a small transformer on a toy dataset) with a time budget of a few hours to a few days. Treat the write-up with the same care as the code - reviewers read it as a proxy for how the candidate would communicate results to a research team. Clarify the time budget and evaluation criteria before starting if they are not specified.