How long should I prepare for a machine learning interview?

Most candidates need 6–12 weeks of structured preparation. Spend the first 2 weeks on algorithms coding (LeetCode Mediums), the next 3–4 weeks on ML implementation and theory flashcards, and the final 2–3 weeks on company-specific mock interviews and ML system design. Candidates with weaker coding foundations should add 2–4 weeks of pure LeetCode practice first. Starting preparation before sending any applications is strongly recommended.

What topics are most heavily tested in ML research scientist interviews?

Transformer architecture and attention mechanisms - scaled dot-product attention, multi-head attention, causal masking, Flash Attention - are tested at virtually every lab. LLM training and alignment (RLHF, SFT, DPO, PPO, GRPO) is the second most consistently covered area. Deep learning fundamentals (backpropagation derivations, optimisers, normalisation schemes) and distributed training (DDP, FSDP, tensor parallelism) round out the most heavily tested themes.

Is a PhD required to become a Research Scientist at an AI lab?

A PhD is not strictly required, but it is the default expectation at most top labs (DeepMind, Anthropic, OpenAI, Meta AI) for Research Scientist roles. A strong publication record - 3 or more first-author papers at ICLR, NeurIPS, or ICML - can substitute in practice. ML Engineer roles are more accessible without a PhD but still demand strong implementation skills. Senior RS roles almost universally expect a PhD or equivalent research output.

What is the hardest part of an ML interview?

The ML coding round is consistently identified as the most difficult by candidates who have not practised implementation. Writing a full transformer decoder from scratch - with correct tensor shapes throughout, causal masking, residual connections, and layer normalisation - under 45-minute time pressure is substantially harder than understanding the architecture conceptually. The gap between knowing how attention works and implementing it cleanly under time pressure is real and requires dedicated practice to close.

How many LeetCode problems do I need for ML interviews?

Complete at least Blind 75 and approximately 75 additional NeetCode Mediums, for a total of around 150 Medium problems. Hard problems appear occasionally but are rare; fluency with every Medium pattern is more valuable than sporadic Hard attempts. The core patterns tested are DFS/BFS, dynamic programming, binary search, two pointers, sliding window, and backtracking. Target 20 minutes or fewer per Medium as your benchmark.

What transformer questions are asked in ML interviews?

The most commonly asked transformer questions are: explain scaled dot-product attention and why scaling by √d_k is necessary; what does multi-head attention provide that single-head cannot; explain causal masking for autoregressive language models; compare sinusoidal vs learned vs RoPE positional encodings; explain the Flash Attention tiling algorithm and its O(n) memory guarantee; explain Mixture of Experts and conditional computation; and describe the encoder-only / decoder-only / encoder-decoder architectures with concrete examples.

What is RLHF and how is it tested in AI interviews?

RLHF (Reinforcement Learning from Human Feedback) is the three-stage post-training alignment pipeline: (1) supervised fine-tuning on high-quality demonstrations, (2) reward model training on human preference comparison pairs, and (3) RL fine-tuning - typically PPO - against the reward model with a KL divergence penalty to prevent the policy from drifting too far from the SFT baseline. Interviewers expect end-to-end explanations of each stage, awareness of reward hacking failure modes, and the ability to compare PPO-based RLHF with DPO.

Should I always negotiate a machine learning job offer?

Yes. Companies consistently have more room to improve initial offers than candidates expect - particularly on base salary, signing bonus, and RSU grant. The downside risk of negotiating respectfully is essentially zero. Increases of 10–30% on total compensation are achievable when you hold competing offers and communicate genuine interest clearly. Never accept the first number without at least asking whether there is flexibility.

What is the difference between a Research Scientist and an ML Engineer?

Research Scientists at top labs are expected to independently identify research problems, design experiments, and contribute publishable findings. The interview focuses heavily on research depth, publication record, and intellectual curiosity. ML Engineers focus on building, scaling, and productionising ML systems. The interview is more implementation and systems-heavy, with greater emphasis on ML system design, distributed training, and software engineering. Many labs hire both tracks and the distinction narrows at senior levels.

What does a machine learning system design interview look like?

ML system design rounds are open-ended: a problem is stated (e.g. "design a recommendation system for 100M users") and the candidate drives the discussion. The interviewer evaluates problem decomposition, awareness of real-world constraints, ability to reason about tradeoffs between approaches, and communication quality. Common questions include: designing training infrastructure for large models, recommendation and retrieval systems, RAG pipelines, online learning systems, and model serving for large LLMs with latency SLAs.

What is Flash Attention and why is it important for ML interviews?

Flash Attention is an IO-aware implementation of scaled dot-product attention that tiles the computation across GPU SRAM to avoid materialising the full N×N attention matrix in HBM (GPU memory). Standard attention has O(N²) memory complexity; Flash Attention reduces this to O(N) by computing softmax and the output in fused passes over tiles. The result is 2–4× speedup and a proportional memory saving with mathematically identical outputs. It is important in interviews because it appears on almost every ML coding and theory question list at research labs.

What is LoRA and how does it work?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that freezes the pretrained weight matrix W and adds a low-rank update ΔW = BA, where B ∈ R^(d×r) and A ∈ R^(r×k) with rank r ≪ min(d, k). During fine-tuning, only A and B are trained, reducing the number of trainable parameters from d×k to r×(d+k). For a d=4096 projection with r=8, LoRA reduces trainable parameters by ~256×. At inference, the update can be merged back into W with no added latency.

How do you explain the difference between PPO and DPO?

PPO (Proximal Policy Optimisation) is an online RL algorithm that trains the language model policy against a separately trained reward model, using a clipped surrogate objective and a KL penalty to prevent instability. DPO (Direct Preference Optimisation) re-parameterises the RLHF objective to eliminate the need for an explicit reward model and RL loop: it directly optimises the policy on preference pairs using a binary cross-entropy loss. DPO is simpler to implement and more stable, but it is offline and may be less effective when the preference data distribution differs from the model's current distribution.

What distributed training concepts should I know for ML interviews?

The four primary paradigms are: Data Parallel (DDP) - each GPU holds a full model copy and gradients are AllReduced; Fully Sharded Data Parallel (FSDP) - parameters, gradients, and optimiser states are sharded across GPUs for large-model training; Tensor Parallelism - individual weight matrices are split across GPUs (requiring AllReduce within each layer); and Pipeline Parallelism - model layers are split across stages with micro-batch interleaving to hide bubble overhead. Know when to use each and the communication primitives each relies on (AllReduce, AllGather, ReduceScatter, P2P). See the ML System Design section for a side-by-side comparison table.

How is an ML research scientist interview different from a general software engineering interview?

A general SE interview at most tech companies is almost entirely algorithms coding plus a system design round. An ML research scientist interview adds three rounds that SE loops do not have: ML coding (implementing model components from scratch, not calling library functions), ML theory (derivations, not definitions), and a research discussion round where a senior researcher probes the depth and originality of past work. The algorithms coding bar is usually similar or slightly lower than at FAANG-style SE interviews, but the combined technical surface area is larger. ML Engineer interviews sit in between - heavier on ML system design and production concerns, lighter on open-ended research discussion.

What should I do immediately after a bad interview round?

Write down what went wrong while it is fresh - the specific question, where the explanation broke down, and what the correct approach was - then close the laptop and stop reviewing it for the rest of the day. Rehashing a single round rarely improves the next one and reliably damages confidence going into it. A single weak round is not a reliable predictor of the final decision: interview panels typically weight the round average and look for consistent strengths, not a single low score in isolation. If a debrief or feedback call is offered, take it - specific, dated feedback is one of the few high-signal artifacts available during the process.

Are take-home assignments common in ML interviews?

Less common at the largest labs (DeepMind, OpenAI, Anthropic, Meta AI), more common at mid-size AI companies and startups, where they often replace a live ML coding round entirely. Typical formats: a small modelling task with a provided dataset and a write-up, or a focused implementation exercise (e.g., implement and train a small transformer on a toy dataset) with a time budget of a few hours to a few days. Treat the write-up with the same care as the code - reviewers read it as a proxy for how the candidate would communicate results to a research team. Clarify the time budget and evaluation criteria before starting if they are not specified.

Career Open Access

Machine Learning Interview Guide: Complete Preparation for Research Scientists and ML Engineers

By Suchibrata Patra

June 2026

Abstract

Landing a Research Scientist or Machine Learning Engineer role at a top AI lab is one of the most demanding hiring processes in technology. This guide consolidates field-tested preparation strategies covering the complete interview lifecycle: securing interviews through publications and targeted outreach; navigating every technical round - algorithms coding, ML coding, deep learning theory, transformer and LLM questions, reinforcement learning, ML system design; managing process logistics and timing; negotiating compensation including the mechanics of RSUs versus stock options; and making the final offer decision. A comprehensive topic reference of 90+ ML concepts, a preparation checklist, and an expanded FAQ close the guide.

1. Introduction: What This Machine Learning Interview Guide Covers

There is almost no practical, candid information available on what it actually takes to secure a Research Scientist or Machine Learning Engineer position at a top AI lab. This guide aims to fill that gap - written from firsthand experience running a full interview process across multiple frontier AI labs and AI-focused companies, combined with structured research into primary-source papers and publicly available preparation material.

A few honest context notes upfront. Some processes conclude after an offer has already been signed elsewhere - that is normal and does not mean those companies were uninterested. A rejection at one track (e.g. Research Scientist) sometimes comes with a redirect to a different track (e.g. an Engineering or Applied role) at the same company. And applications to several strong-fit companies will not progress past the initial screen at all, even with a strong profile - résumé and applicant-tracking-system screening is noisy at high-volume employers, and research from Harvard Business School and Accenture on hiring funnels documents how qualified candidates are routinely filtered out for reasons that have little to do with their actual ability.

This guide covers machine learning interview preparation end-to-end: how to get the first call, how to prepare for every technical round, which topics to study (deep learning, transformers, LLMs, reinforcement learning, distributed training, ML system design), how to handle logistics and negotiation, and how to make the final decision. Specific question banks for each interview type are included throughout.

Key Insight: The process is inherently stochastic. Offers have come from companies where interviews felt rocky, and rejections from companies where interviews felt excellent. Calibrate your emotional response accordingly - any individual outcome is only weakly predictive of ability.

If you need to first build the underlying knowledge before tackling interview prep, our Complete ML Learning Roadmap covers each subject area in depth across a structured 12–18 month path from beginner to interview-ready.

2. Machine Learning Interview Preparation Checklist

Use this checklist to track readiness across every dimension of the ML interview process. Candidates who arrive at first-round interviews without completing the coding and implementation items are consistently underprepared, regardless of research depth.

2.1 Application and Pipeline Readiness

CV updated with accurate, specific descriptions of research contributions
Target companies identified and prioritised (at least 8–12 applications)
Cold email templates drafted for key hiring managers
LinkedIn and Google Scholar profiles current and consistent
Spreadsheet set up to track company, stage, deadlines, and contacts
At least one warm referral activated per target company (where possible)

2.2 Algorithms Coding Readiness

Blind 75 completed with solutions understood (not just memorised)
NeetCode 150 Mediums completed (or equivalent ~150 Medium problems)
Core patterns fluent: DFS, BFS, graphs, dynamic programming, backtracking, binary search, two pointers, sliding window
Able to solve a LeetCode Medium in under 20 minutes consistently
Optimal time complexity known for each core pattern (not just a working solution)

2.3 ML Coding and Implementation Readiness

Scaled dot-product attention implemented from scratch (PyTorch or JAX)
Multi-head attention implemented from scratch with correct head split/merge
Full transformer block implemented (encoder, decoder, or decoder-only)
Flash Attention algorithm understood at implementation level
Attention backward pass derived from first principles
MLP forward pass and manual backpropagation implemented
Training loop with gradient clipping written without reference
Common debugging scenarios practised: gradient explosion, NaN losses, shape mismatches
LoRA implementation understood (low-rank decomposition mechanics)

2.4 ML Theory and Knowledge Readiness

Flashcards written (not downloaded) for all 90+ topics in the reference list
RLHF pipeline understood end-to-end (reward model, PPO, GRPO)
Diffusion model forward and reverse processes explained without notes
Scaling laws, MoE, and positional encoding variants (RoPE, sinusoidal) reviewed
Distributed training paradigms covered: DDP, FSDP, tensor parallelism, pipeline parallelism
At least one ML system design question practised end-to-end
Research papers on CV read and opinions prepared (at least 3 recent papers)

2.5 Logistics and Process Readiness

Interview schedule sequenced: lower-priority companies first
All active processes disclosed to each company (standard practice)
On-demand coding tests held until most target companies have scheduled
Competing offers understood well enough to negotiate with specifics
RSU vs stock option mechanics understood before any offer arrives
Pre-interview routine established (sleep, exercise, cognitive reset)

3. How to Get Machine Learning Interviews at Top AI Labs

Getting past the application stage is its own challenge. The main levers are well-known: more papers, trendier research topics, and stronger internship experience. A rough benchmark that holds across top labs: 3 or more first-author papers published at ICLR, NeurIPS, or ICML, plus at least one internship or prior industry role, is typically the threshold for consistent callbacks. Work in high-demand areas - large language models, RLHF, diffusion models, and multimodal AI - generates substantially more traction than equally strong work in less fashionable areas.

The pivot that matters most: Once interviews are secured, more papers provide no additional advantage - interviewers frequently do not read a candidate's CV in detail. The preparation focus should shift entirely to interview skills, starting immediately rather than after the next paper is submitted. See Algorithms Coding and ML Coding & Implementation for where that time is best spent first.

3.1 Application Channels That Actually Work

LinkedIn and X (Twitter): Many labs - especially for internships - advertise roles that require filling out a linked Google Form to be considered. Simply clicking "Apply" on LinkedIn Jobs itself is often not sufficient. Following researchers and hiring managers at target companies is the most reliable way to catch these posts before they fill.

Referrals: Helpful, but not necessary. At competitive labs, referred and unreferred candidates both receive interview invitations - referrals can accelerate timelines and increase visibility, but their absence should not prevent an application. If a connection exists at a target company and no progress has been made after applying directly, asking for a referral is worth attempting.

Cold emails: Emailing a hiring manager or team member directly is frequently appreciated. The email should not simply restate a CV - it should explain specifically why the candidate is a strong fit for that team and what genuinely excites them about the work. A well-targeted cold email to a frontier AI lab can and does get a direct, personal reply from a hiring manager often enough to be worth the effort. Even in cases where the email goes unanswered, interviews sometimes still proceed - but the outreach creates a useful secondary signal of genuine interest.

Cover letters: Rarely required, but worth doing properly when they are. Cover letters generated wholesale by an AI model are easy to identify and make a poor impression. The better approach: write the letter authentically, then use a tool to polish the prose. Personality and genuine excitement are what differentiate these letters.

3.2 Research Scientist at a Startup vs a Large AI Lab

Factor	Large AI Lab (DeepMind, Anthropic, Meta AI, OpenAI)	AI Startup (seed to Series C)
Discoverability	Roles are public and well-known; competition is extremely high	Harder to find - word of mouth is the primary channel; lower competition
Research Work	Consistently high-quality; access to large compute resources	Can be world-class or engineering-focused; agenda may shift frequently
Growth & Visibility	One of many researchers; slower path to ownership	High visibility; faster growth and influence over research direction
Compensation	Salary plus RSUs with relatively clear liquidity path	Salary plus stock options - upside possible but should be heavily discounted
CV Signal	Immediately recognised globally	Requires explanation; less portable unless company achieves prominence
Compute Access	Large-scale GPU/TPU clusters for pretraining and experiments	Variable - may be constrained; cloud credits common at early stage

4. Machine Learning Interview Structure: What to Expect at Each Stage

Most top AI labs and ML-focused tech companies follow a broadly similar interview structure, though the weight and difficulty given to each stage varies considerably. Expect 3–8 technical interviews depending on the lab. The rounds below represent the full spectrum - not every company uses all of them.

#	Round	Duration	What Is Tested
01	Recruiter Screen	30–45 min	Background, motivation, role fit, basic CV probing
02	Algorithms Coding	45–60 min	LeetCode-style Mediums: DFS/BFS, DP, binary search, two pointers
03	ML Coding & Debugging	45–60 min	Implementing ML components from scratch; debugging existing code
04	ML Theory & Knowledge	45–60 min	Deep learning, transformers, LLMs, RL, distributed training concepts
05	Behavioural / Research Discussion	30–60 min	STAR stories, research interests, opinions on frontier work
06	ML System Design	45–60 min	Open-ended design of ML pipelines, infrastructure, and serving systems

Underestimated round: Behavioural and research discussion rounds feel casual compared to technical interviews. Underestimating them is a common and avoidable mistake. Prepare concrete STAR stories and genuine research perspectives in advance. Interviewers at top labs are often senior researchers - shallow opinions on frontier topics are immediately apparent.

Figure 1. The end-to-end ML interview pipeline. Each box links to the matching section of this guide.

5. Algorithms Coding Interview

The algorithms coding round tests LeetCode-style problem solving at Medium difficulty. This round is separate from ML coding - it evaluates pure data structures and algorithms fluency and is a prerequisite gate at most top labs.

5.1 Core Patterns to Master

Graph Traversal (DFS / BFS): Connected components, shortest paths, cycle detection
Dynamic Programming: 1D and 2D DP, memoisation vs tabulation, common templates
Binary Search: Search on answer, rotated arrays, finding boundaries
Two Pointers: Sorted arrays, palindromes, container problems
Sliding Window: Variable and fixed windows, frequency counting
Backtracking: Permutations, combinations, N-queens variants
Heap / Priority Queue: Top-k problems, merge k sorted lists
Trees: Traversals, LCA, diameter, serialisation

5.2 Most Common ML Interview Coding Questions

The questions below appear across virtually every ML interview process at top labs. Treat each as a starting point for deeper pattern exploration.

Machine Learning Fundamentals Questions

Explain the bias-variance tradeoff. How does it relate to model complexity?
What is the difference between L1 and L2 regularisation? When would you use each?
Explain backpropagation from first principles. Derive the gradient for a two-layer network.
What is the vanishing gradient problem, and how do residual connections address it?
Explain batch normalisation, layer normalisation, and RMSNorm. What are the differences?
What is the difference between MLE and MAP estimation? Give a concrete example.
Explain precision, recall, F1, and AUC-ROC. When does each metric matter most?
What is KL divergence? How is it used in VAEs?
Explain the Adam optimiser. What does it do that SGD with momentum does not?
What is the curse of dimensionality? How does it affect nearest-neighbour search?

Research Scientist Interview Questions

Walk me through your most significant research contribution. What made it non-obvious?
What open problem in machine learning do you find most interesting right now, and why?
You have a negative result in your experiments. How do you decide whether to change the hypothesis or the method?
Describe a time you disagreed with a collaborator on a research direction. How was it resolved?
How do you approach reproducing results from a paper that lacks sufficient implementation detail?
What is your opinion on the current direction of scaling language models? Is scaling sufficient for AGI?
Pick a recent paper you found compelling. What would you do next to extend it?

5.3 Preparation Strategy

Complete Blind 75 first - these cover the essential patterns. Then work through approximately 75 additional NeetCode Mediums with a focus on Dynamic Programming and Graphs, which appear most frequently in ML lab interviews. Target 20 minutes or fewer per Medium problem. If stuck beyond 15 minutes, look up the solution, understand the pattern, flag it for review, and move on. Breadth across patterns matters more than deep mastery of any single type.

6. ML Coding and Implementation Interview

The ML coding round tests two distinct skills: implementing ML components from scratch with correct logic and tensor shapes, and debugging existing ML code efficiently. Both require deep familiarity with PyTorch or JAX at the level of understanding what every line does - not just using high-level APIs.

6.1 Implementation Questions

Implement scaled dot-product attention from scratch. Include the masking logic for causal attention.
Implement multi-head attention. Correctly handle the split and merge of heads.
Implement a full transformer decoder block (self-attention + FFN + residuals + layer norm).
Implement Flash Attention at an algorithmic level. Explain why it is memory-efficient.
Implement the backward pass for softmax from first principles.
Implement a simple training loop with gradient clipping in PyTorch. Include a learning rate scheduler.
Implement LoRA (Low-Rank Adaptation). Explain how it reduces trainable parameters.
Implement k-means clustering from scratch. Discuss convergence and initialisation.
Implement dropout correctly during training and inference.
Implement sinusoidal and RoPE positional encodings.

6.2 Reference Implementations

# Scaled dot-product attention - from scratch
import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Args:
        Q, K, V: (batch, heads, seq_len, d_k)
        mask:    optional causal or padding mask (bool tensor)
    Returns:
        output:  (batch, heads, seq_len, d_k)
        weights: (batch, heads, seq_len, seq_len)
    """
    d_k = Q.size(-1)
    scores = Q @ K.transpose(-2, -1) / d_k ** 0.5   # (B, H, T, T)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    weights = F.softmax(scores, dim=-1)
    return weights @ V, weights                       # (B, H, T, d_v)

# LoRA: Low-Rank Adaptation of a linear layer
import torch.nn as nn

class LoRALinear(nn.Module):
    def __init__(self, in_features, out_features, rank=4, alpha=1.0):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        self.A = nn.Parameter(torch.randn(rank, in_features) * 0.01)
        self.B = nn.Parameter(torch.zeros(out_features, rank))
        self.alpha = alpha
        self.rank  = rank

    def forward(self, x):
        base = x @ self.weight.T
        lora = x @ self.A.T @ self.B.T
        return base + (self.alpha / self.rank) * lora

# Multi-head attention - complete module
import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        assert d_model % n_heads == 0
        self.n_heads = n_heads
        self.d_k     = d_model // n_heads
        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x, mask=None):
        B, T, C = x.shape
        # Project and reshape to (B, H, T, d_k)
        Q = self.W_q(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        # Scaled dot-product attention
        scores = Q @ K.transpose(-2, -1) / self.d_k ** 0.5
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn   = F.softmax(scores, dim=-1)
        out    = attn @ V                              # (B, H, T, d_k)
        # Merge heads and project
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.W_o(out)

6.3 Debugging Questions

This training loop is producing NaN losses after 200 steps. What are the most likely causes and how do you diagnose them?
Training loss is decreasing but validation loss is flat. What is happening and what would you change?
This attention implementation produces shapes (B, H, T, T) but the output is wrong. Find the bug.
Gradient norms are exploding after 50 steps. What would you check first?
A PyTorch model trains correctly on CPU but produces wrong outputs on GPU. What could cause this?

6.4 ML Implementation Checklist

Full transformer architecture from scratch - reference nanoGPT
Causal self-attention, cross-attention, and multi-head attention
Flash Attention - algorithm-level understanding and implementation rationale
Attention backward pass from first principles
MLP forward pass and manual backpropagation
LoRA implementation (low-rank decomposition of weight updates)
Training loop with gradient clipping in PyTorch or JAX
Debugging: gradient explosion, NaN losses, shape mismatches, training loop bugs
RLHF pipeline: reward model training, PPO loop, KL penalty
Diffusion model forward process (noise schedule) and reverse process (DDPM/DDIM)

7. Deep Learning Theory Interview Questions

Deep learning questions test both conceptual understanding and the ability to reason from first principles. Interviewers at top AI labs expect derivations, not just definitions.

7.1 Architecture and Training

Explain the residual connection in ResNets. Why does adding the identity shortcut help training?
Derive the backpropagation update rule for a two-layer neural network with cross-entropy loss.
Explain why weight initialisation matters. Describe Xavier and He initialisation and when each is used.
What is the difference between batch normalisation and layer normalisation? Why do transformers use layer norm?
Explain how dropout acts as a regulariser. What is it approximating at test time?
What causes the exploding and vanishing gradient problems? How do gradient clipping and residual connections address them?
Explain the difference between RNNs, LSTMs, and GRUs. What limitation of RNNs do LSTMs solve?
Explain convolutional neural networks. What is the inductive bias that makes them effective for image data?
What is the difference between online, mini-batch, and full-batch gradient descent? What are the tradeoffs?

7.2 Optimisation

Derive the Adam update rule. What do the first and second moment estimates represent?
What is the learning rate warmup and why is it important for transformer training?
Explain cosine annealing with warm restarts. When is it preferred over a fixed learning rate?
What is gradient accumulation? When and why would you use it?
Explain mixed precision training (BF16 vs FP16 vs FP32). What are the stability differences?

7.3 Generative Modelling

Explain the VAE objective (ELBO). What is the role of the KL term and the reconstruction term?
Explain how GANs work. What is the minimax objective? What training instabilities can arise?
Explain diffusion models. What happens in the forward process? What does the model learn to predict in the reverse process?
What is DDIM? How does it differ from DDPM in sampling?
Explain flow matching. What is the ODE it solves and how does it relate to diffusion?
What is classifier-free guidance and why does it improve sample quality?

8. Transformer and LLM Interview Questions

Transformer architecture questions are among the most heavily tested in ML interviews at research-focused labs. Interviewers expect candidates to understand transformers at implementation level - not just at the level of calling nn.MultiheadAttention. See our companion Transformers Explained article for textbook-level derivations of each component.

8.1 Architecture Questions

Explain the scaled dot-product attention mechanism. Why is the scaling by √d_k necessary?
What is multi-head attention? What does each head learn to attend to?
Explain the difference between encoder-only, decoder-only, and encoder-decoder transformer architectures. Give examples of each.
What is causal (masked) self-attention? Why is it necessary for autoregressive language models?
Explain cross-attention. In what architectures is it used?
What is the role of the feed-forward sublayer in a transformer block? What is the typical expansion ratio?
Explain sinusoidal positional encodings. Why are they used rather than learned embeddings?
Explain RoPE (Rotary Position Embedding). What advantage does it have for handling longer context?
What is Flash Attention? Explain the tiling algorithm and why it reduces memory usage from O(n²) to O(n).
What is Transformer-XL? How does it handle long-range dependencies beyond a fixed context window?
What is the Griffin architecture? How does it combine recurrence and attention?

Figure 2. A single decoder-only transformer block - the unit interviewers most often ask candidates to implement from scratch (see ML Coding & Implementation).

8.2 LLM Training and Post-Training Alignment

Explain RLHF (Reinforcement Learning from Human Feedback) end-to-end. What are the three training stages?
What is the reward model in RLHF? How is it trained? What are its failure modes?
Explain PPO as used in RLHF fine-tuning. What does the KL penalty term do?
What is DPO (Direct Preference Optimisation)? How does it differ from PPO-based RLHF?
What is GRPO? How does it differ from PPO in the context of reasoning models?
Explain instruction tuning (SFT). What data format is used and what does it teach the model?
What is the Chinchilla scaling law? What was its key finding about compute-optimal training?
Explain KV cache. Why is it used during inference and what memory tradeoffs does it create?
What is speculative decoding? How does it improve inference throughput?
What is quantisation (INT8, INT4)? What accuracy tradeoffs does it introduce?
Explain context length extension techniques: RoPE/position interpolation, ALiBi, and YaRN.
What are hallucinations in LLMs? What are the known mechanisms and mitigations?
Explain retrieval-augmented generation (RAG). What problems does it solve and what are its limitations?

8.3 Training and Efficiency

What are scaling laws for transformer language models? What do they predict about the optimal balance of compute, data, and parameters?
Explain Mixture of Experts (MoE). How does conditional computation reduce cost per token?
What is LoRA? Derive why it reduces the number of trainable parameters during fine-tuning.
Explain tokenisation. What is BPE and why is it used for LLM vocabularies?
What are the standard decoding strategies for language models? Compare greedy, beam search, top-k, and nucleus sampling.

9. Reinforcement Learning Interview Questions

Reinforcement learning questions are heavily tested at labs whose work intersects with alignment, reasoning, robotics, and game-playing AI. Candidates with RL research backgrounds should expect depth; those from other areas should ensure solid coverage of fundamentals and the RLHF pipeline.

9.1 Foundations

Define a Markov Decision Process (MDP). What are the components (S, A, R, P, γ)?
What is the Bellman equation? Derive the Bellman optimality equation for Q-values.
Explain the difference between on-policy and off-policy learning. Give an example of each.
What is temporal difference (TD) learning? How does it differ from Monte Carlo methods?
Explain Q-learning. Why is it off-policy? What is its convergence guarantee?
Explain the policy gradient theorem. Derive the REINFORCE update rule.
What is the credit assignment problem? How do algorithms like GAE address it?
Explain the exploration vs exploitation tradeoff. What strategies exist beyond ε-greedy?

9.2 Modern Algorithms

Explain PPO (Proximal Policy Optimisation). What is the clipped surrogate objective and why is clipping necessary?
Explain GAE (Generalised Advantage Estimation). What does the λ parameter control?
Explain Soft Actor-Critic (SAC). What is the maximum entropy objective and what problem does it solve?
What is GRPO? How is it used in reinforcement learning for reasoning in LLMs?
Explain actor-critic methods. How do they combine policy gradient and value function learning?
What is importance sampling? How is it used in off-policy policy gradient methods?
Explain model-based RL. What are the tradeoffs compared to model-free approaches?
Explain MuZero. How does it perform planning without access to the true environment dynamics?
What is curriculum learning in RL? How is it used to train agents on complex tasks?

10. ML System Design Interview

ML system design rounds are more common in ML Engineer and Applied Scientist interviews than in pure Research Scientist tracks, but they appear across all roles at scale-focused labs. The format is open-ended: a problem is stated, and the candidate is expected to drive the design. Communication, reasoning about tradeoffs, and awareness of real-world constraints matter as much as technical depth.

10.1 Common ML System Design Questions

Design a system to train a large language model across hundreds of GPUs. What parallelism strategies would you use?
Design a recommendation system for a platform with 100M users and 10M items. How do you handle cold start?
Design an online learning system that updates a model in real time from user feedback. What are the key failure modes?
Design a retrieval-augmented generation (RAG) system for a legal document assistant. How do you handle retrieval latency?
Design a model serving infrastructure for a 70B parameter LLM with a P95 latency target of 500ms. What optimisations would you apply?
Design a distributed training pipeline that minimises communication overhead between nodes. Compare DDP, FSDP, and tensor parallelism.
You are training a diffusion model and GPU memory is the bottleneck. What techniques would you apply?
Design a data pipeline for pretraining a 7B parameter language model. How do you handle data quality, deduplication, and filtering at scale?

10.2 Key Distributed Training Concepts

Figure 3. How each parallelism strategy partitions a model across four GPUs. Full mechanics and tradeoffs are in the table below.

Strategy	What Is Parallelised	Memory Savings	Communication Overhead	Best Used When
Data Parallel (DDP)	Data batches across GPUs	None (model replicated)	Gradient AllReduce each step	Model fits on one GPU
Fully Sharded (FSDP)	Parameters, gradients, optimiser states	High - linear in world size	AllGather + ReduceScatter	Model too large for one GPU
Tensor Parallel	Individual weight matrices	High - linear in TP degree	AllReduce within each layer	Very large single layers (attention, FFN)
Pipeline Parallel	Model layers across stages	Moderate	P2P activation passing	Very deep models with many layers

11. Behavioural and Research Discussion Round

Two types of round appear here: classic behavioural (STAR-format conflict, feedback, and collaboration questions) and research discussion (interests, field vision, and opinions on recent work). Both are underestimated compared to technical rounds - prepare for each with the same rigour.

11.1 Behavioural Questions (STAR Format)

Tell me about a time you disagreed with a colleague or manager on a technical decision. How did you handle it?
Describe a situation where you had to deliver critical feedback to a peer. How did you approach it?
Tell me about a project that failed. What happened and what did you take away?
Describe a time you had to learn a new technical area quickly. How did you approach it?
Tell me about a time you had competing priorities. How did you decide what to focus on?
Describe your most significant collaborative research contribution. What was your specific role?

11.2 Research Discussion Questions

What open problem in ML do you find most interesting right now, and why?
What is your opinion on the current direction of scaling language models?
Walk me through your most significant research contribution. What made it non-obvious?
Pick a recent paper you found compelling. What would you do next to extend it?
How do you approach reproducing results from a paper that lacks sufficient implementation detail?
You have a negative result in your experiments. How do you decide whether to change the hypothesis or the method?

12. Understanding Compensation: RSUs vs Stock Options

Equity compensation is frequently misunderstood at the point of offer evaluation. The distinction between RSUs and stock options matters considerably - particularly under UK law and taxation - and candidates who do not understand the mechanics are consistently at a disadvantage during negotiation.

At a glance
	RSUs	Stock Options
What you get	Actual shares, on vesting	The right to buy shares at a fixed price
Typical at	Large, established companies	Early- to mid-stage AI startups
Value if stock falls	Still worth something	Can be worth nothing
Tax trigger	At vesting (ordinary income)	At exercise, then again at sale
Cash needed upfront	None	Often yes - to exercise and pay tax

RSUs (Restricted Stock Units) - typical at large tech companies such as Google DeepMind, Meta, and Microsoft - grant actual shares in the company on a vesting schedule. When they vest, shares can be sold or held. A portion - often around half - is automatically sold to cover income tax, because vested RSUs are treated as ordinary income.

Stock options - typical at AI startups - grant the right to purchase shares at a fixed strike price X, regardless of the market price Y at exercise. If Y > X, exercising and selling generates a profit. If Y < X, the options are worthless. In the UK, options granted under a qualifying Enterprise Management Incentive (EMI) scheme carry meaningfully better tax treatment than unapproved options - confirm with the offering company which type is being granted before modelling the numbers below.

Warning: Stock options typically expire 90 days after leaving the company. If the company is not yet public, shares cannot be sold after exercising - meaning the full purchase cost must be paid in cash, with income tax owed on the paper gain immediately, before any cash has been received.

To make the mechanics concrete, consider a UK stock option scenario:

Item	Value
Strike price (X)	£10 per share
Current valuation price (Y)	£50 per share
Options granted	10,000
Cost to exercise	£100,000
Paper gain subject to income tax	£400,000
Income tax owed (45% rate)	£180,000
Total cash outlay before seeing a penny	£280,000

When a recruiter quotes a total compensation figure that includes startup equity, apply a significant mental discount to that number. Cashless exercise options and liquidity events can reduce this burden, but each new funding round dilutes existing shares, and liquidity events typically value stock below the official company valuation.

13. How to Negotiate a Machine Learning Job Offer

Companies consistently have more room to improve initial offers than candidates typically expect. Negotiating is always worth attempting - the downside is essentially zero, and material increases of 10–30% on total compensation are common when handled correctly.

The "blind auction" approach - not revealing competing offers to preserve leverage - frequently does not work in practice at top AI labs. Many companies explicitly request proof of competing offers before authorising increases, and some will verify the details. Sharing offers openly, when asked, is usually the more effective strategy.

"Recruiters are skilled at reading genuine preferences from small signals - how often a candidate mentions a company, the tone they use when discussing it. If a recruiter knows their company is already the top choice, negotiating leverage is significantly reduced."

13.1 Practical Negotiation Notes

Deadlines range from one week to two weeks or longer and are not always flexible - though some companies will extend for the right candidate.
Companies maintain historical data on candidate decisions. A competing offer from a peer lab (OpenAI, Anthropic, Google DeepMind, Meta AI) carries genuine weight because the data shows it is a real alternative.
An offer from a company that candidates rarely choose over the target lab does not create meaningful leverage, regardless of the headline number.
Negotiate base salary, signing bonus, and RSU grant separately - each has different flexibility.
Tell every company about other active processes. This is standard practice, keeps timelines clear, and prompts companies to move faster when they are interested.

14. How to Choose Between Machine Learning Job Offers

Watch for this trap: The temptation to accept an early offer out of fear that nothing better will materialise is a predictable psychological pattern - particularly for candidates who have not been through the process before. Better offers do exist further along the process, and accepting prematurely closes those options. See Negotiation for how to buy time on a deadline without burning goodwill.

14.1 Factors Worth Weighting Heavily

Research agenda alignment: Can you see yourself excited by this team's direction in two years, not just today?
Team quality: Speak with at least two team members who are not the hiring manager before deciding.
Compute access: For research roles, GPU and TPU access is a direct constraint on what research is possible.
Equity mechanics: Understand whether equity is RSUs or options before signing - see Compensation for the mechanics.
Publication norms: Some labs publish aggressively; others are more closed. Know which environment suits you.
Location and life: Career decisions that work against life outside of work compound negatively over time.

Whose opinion to weight most: The most reliable input at the final stage is the perspective of people who know the candidate well - not the people at the companies being evaluated, who are predictably biased toward their own employers. Talking through the decision with trusted individuals who understand your actual values and goals tends to surface clearer answers than further research on the companies themselves.

15. Preparing Mentally for the ML Interview Process

The machine learning interview process at top AI labs is genuinely demanding, and its cumulative effect on mental resilience should not be underestimated - even for candidates who handle pressure well in other contexts.

15.1 Practical Strategies for Sustaining Performance

Sleep: Sleep disruption the night before high-stakes interviews is common and compounds significantly across multiple same-week interviews. Prioritise a consistent sleep routine in the weeks before the process begins - not under pressure on the eve of an interview.
Physical preparation: Exercise before interviews - particularly cardiovascular activity - reduces nervous energy and resets cognitive state. Keep intensity easy and eat sufficient carbohydrates beforehand.
Social connection: Isolation during an intense interview period is counterproductive. Structure at least some social time on evenings without early-morning interviews.
Pre-interview ritual: A consistent pre-interview routine provides a reliable anchor against unpredictable anxiety. The specific content matters less than the consistency.

Keep this in view: A candidate's worth as a researcher is not determined by these interviews. The process contains enough randomness that a single poor round - even on a well-known topic - does not constitute meaningful evidence of ability (see the Key Insight in the introduction). Emotional preparation is most effective when done before the process begins rather than discovered under fire.

15.2 Recommended Reading on Mindset and Performance

Book	Why It Helps
The Now Habit - Neil Fiore	Practical approach to procrastination and performance anxiety rooted in psychological research. Particularly useful for candidates who procrastinate on interview preparation.
Mindset - Carol Dweck	The foundational text on growth versus fixed mindset. Directly applicable to navigating rejection during an extended interview process.
The Gifts of Imperfection - Brené Brown	Useful for the underlying work of decoupling self-worth from professional outcomes - which becomes important when facing multiple simultaneous rejections.

16. Full Machine Learning Interview Topic Reference

The topics below represent a comprehensive review list compiled across a complete interview process at top AI labs. Nearly every topic on this list appeared in at least one interview in some form.

Category	Topics
Reinforcement Learning	Q-Learning / TD Learning, Bellman Equations, PPO, GRPO, GAE, Variance Reduction, DPO, Policy Gradient Theorem, On-Policy vs Off-Policy, Exploration vs Exploitation, Credit Assignment, MuZero / World Models, AlphaGo / AlphaZero, SAC, Model-Based vs Model-Free, MDP, Monte Carlo vs TD, Actor-Critic, SARSA, Importance Sampling, Curriculum Learning
Large Language Models	Flash Attention, LoRA, TransformerXL, Griffin, Perceiver, Scaling Laws & Chinchilla, Mixture of Experts, RoPE, Sinusoidal Embeddings, Relative Positional Embeddings, LLM vs RNN vs S4 vs Mamba, Tokenisation (BPE), Pretraining & SFT, RLHF, Decoding Techniques, Causal Attention, Cross Attention, KV Cache, Speculative Decoding, Quantisation
Generative Modelling	GANs, VAEs and ELBO, Score Function / Score Matching, Diffusion Forward Process, Diffusion Reverse Process (DDIM / DDPM), Diffusion SDE, Flow Matching ODE, Classifier-Free Guidance
Distributed Systems	Tensor Parallelism, FSDP, DDP, Pipeline Parallelism, AllReduce / AllGather, Mixed Precision (BF16), Gradient Checkpointing, Gradient Accumulation & Clipping, Numerical Precision, JIT Compiling, JAX / PyTorch / TensorFlow
General ML	Curse of Dimensionality, S4 / CNNs / RNNs / LSTMs, Autoencoders, Gumbel-Softmax, MLE vs MAP, Newton's Method, Linear Regression, Activation & Loss Functions, No Free Lunch Theorem, BatchNorm / LayerNorm / RMSNorm, Adam / AdamW / Adagrad, Bias-Variance Tradeoff, Backpropagation, Regularisation (L1, L2, Dropout), Clustering, KNN, SVMs, Boosting, Decision Trees, Bayes Theorem, Precision / Recall / F1 / AUC-ROC, KL / JS Divergence, Xavier / He Init, Overfitting / Underfitting, Transfer Learning, Few-Shot / Zero-Shot
Linear Algebra	PSD Matrices, Jacobian & Hessian, Eigenvectors / Eigenvalues, Matrix Inverse, Dot Product, Null Space / Image Space, Orthogonality, Linear Independence, Singular Matrices, Rank / Span, Determinant, SVD

17. What to Do Differently: Lessons from a Completed ML Interview Process

Even in a successful process - one that concluded with multiple competing offers from frontier AI labs and AI-focused companies - certain approaches would have produced better outcomes. These are the highest-value changes in retrospect.

Track everything in a spreadsheet from day one. Missing applications to genuinely interesting companies because of tracking failures is avoidable and frustrating. Use Notion or a simple Google Sheet with columns for company, current stage, deadlines, key contacts, and notes from each round.
Prepare emotionally before the process starts, not during it. The interview process has a way of feeling like a verdict on years of research. Developing a healthier relationship with failure and professional setbacks in advance is significantly more effective than attempting this work mid-process.
Be proactive about companies that go silent. If an application has not resulted in a reply and the company is genuinely interesting, a cold email to someone on the team is a more effective response than passive waiting.
Start ML implementation practice earlier. The gap between knowing how attention works conceptually and being able to implement it cleanly under time pressure is larger than most researchers expect. Four weeks of implementation practice is the minimum; eight weeks is better.
Sequence interviews by preference. Start with lower-priority companies to build calibration and confidence before processes that actually matter reach their final stages.
Manage timing deliberately. The goal is to have multiple offers arrive in the same window. If Company A offers an on-demand test, hold it until Company B's first interview is scheduled. Process timing is difficult to control precisely, but deliberate management is possible and meaningfully changes negotiation leverage.

Recommended Resources for ML Interview Preparation

Type	Resource	What It Covers
Book	Designing ML Systems - Chip Huyen	Applied ML, system design, and ML fundamentals at interview depth. Highlight as you read - it doubles as a flashcard source.
Book (Online)	The JAX Scaling Book	Distributed training, parallelism strategies (DDP, FSDP, tensor parallelism, pipeline parallelism), and large-scale ML systems.
Book	Reinforcement Learning - Sutton & Barto	Only necessary if new to RL. Skim chapters 1–6 and focus on policy gradient methods for interview coverage.
Practice	NeetCode 150	Structured LeetCode practice covering all core algorithmic patterns. Start here before any lab interview.
Practice	DeepML	ML coding practice problems including attention, backpropagation, and training loop exercises.
Code Reference	nanoGPT - Karpathy	Clean, minimal transformer implementation in PyTorch. Read and re-implement from scratch as a preparation exercise.
Course	OpenAI Spinning Up	Practical deep RL introduction covering policy gradient, PPO, SAC, and TRPO with working implementations.

Frequently Asked Questions

Most candidates need 6–12 weeks of structured preparation. Spend the first 2 weeks on algorithms coding (LeetCode Mediums), the next 3–4 weeks on ML implementation and theory flashcards, and the final 2–3 weeks on company-specific mock interviews and ML system design. Candidates with weaker coding foundations should add 2–4 weeks of pure LeetCode practice first. Starting preparation before sending any applications is strongly recommended.
Transformer architecture and attention mechanisms - scaled dot-product attention, multi-head attention, causal masking, Flash Attention - are tested at virtually every lab. LLM training and alignment (RLHF, SFT, DPO, PPO, GRPO) is the second most consistently covered area. Deep learning fundamentals (backpropagation derivations, optimisers, normalisation schemes) and distributed training (DDP, FSDP, tensor parallelism) round out the most heavily tested themes.
A PhD is not strictly required, but it is the default expectation at most top labs (DeepMind, Anthropic, OpenAI, Meta AI) for Research Scientist roles. A strong publication record - 3 or more first-author papers at ICLR, NeurIPS, or ICML - can substitute in practice. ML Engineer roles are more accessible without a PhD but still demand strong implementation skills. Senior RS roles almost universally expect a PhD or equivalent research output.
The ML coding round is consistently identified as the most difficult by candidates who have not practised implementation. Writing a full transformer decoder from scratch - with correct tensor shapes throughout, causal masking, residual connections, and layer normalisation - under 45-minute time pressure is substantially harder than understanding the architecture conceptually. The gap between knowing how attention works and implementing it cleanly under time pressure is real and requires dedicated practice to close.
Complete at least Blind 75 and approximately 75 additional NeetCode Mediums, for a total of around 150 Medium problems. Hard problems appear occasionally but are rare; fluency with every Medium pattern is more valuable than sporadic Hard attempts. The core patterns tested are DFS/BFS, dynamic programming, binary search, two pointers, sliding window, and backtracking. Target 20 minutes or fewer per Medium as your benchmark.
The most commonly asked transformer questions are: explain scaled dot-product attention and why scaling by √d_k is necessary; what does multi-head attention provide that single-head cannot; explain causal masking for autoregressive language models; compare sinusoidal vs learned vs RoPE positional encodings; explain the Flash Attention tiling algorithm and its O(n) memory guarantee; explain Mixture of Experts and conditional computation; and describe the encoder-only / decoder-only / encoder-decoder architectures with concrete examples.
RLHF (Reinforcement Learning from Human Feedback) is the three-stage post-training alignment pipeline: (1) supervised fine-tuning on high-quality demonstrations, (2) reward model training on human preference comparison pairs, and (3) RL fine-tuning - typically PPO - against the reward model with a KL divergence penalty to prevent the policy from drifting too far from the SFT baseline. Interviewers expect end-to-end explanations of each stage, awareness of reward hacking failure modes, and the ability to compare PPO-based RLHF with DPO.
Yes. Companies consistently have more room to improve initial offers than candidates expect - particularly on base salary, signing bonus, and RSU grant. The downside risk of negotiating respectfully is essentially zero. Increases of 10–30% on total compensation are achievable when you hold competing offers and communicate genuine interest clearly. Never accept the first number without at least asking whether there is flexibility.
Research Scientists at top labs are expected to independently identify research problems, design experiments, and contribute publishable findings. The interview focuses heavily on research depth, publication record, and intellectual curiosity. ML Engineers focus on building, scaling, and productionising ML systems. The interview is more implementation and systems-heavy, with greater emphasis on ML system design, distributed training, and software engineering. Many labs hire both tracks and the distinction narrows at senior levels.
ML system design rounds are open-ended: a problem is stated (e.g. "design a recommendation system for 100M users") and the candidate drives the discussion. The interviewer evaluates problem decomposition, awareness of real-world constraints, ability to reason about tradeoffs between approaches, and communication quality. Common questions include: designing training infrastructure for large models, recommendation and retrieval systems, RAG pipelines, online learning systems, and model serving for large LLMs with latency SLAs.
Flash Attention is an IO-aware implementation of scaled dot-product attention that tiles the computation across GPU SRAM to avoid materialising the full N×N attention matrix in HBM (GPU memory). Standard attention has O(N²) memory complexity; Flash Attention reduces this to O(N) by computing softmax and the output in fused passes over tiles. The result is 2–4× speedup and a proportional memory saving with mathematically identical outputs. It is important in interviews because it appears on almost every ML coding and theory question list at research labs.
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that freezes the pretrained weight matrix W and adds a low-rank update ΔW = BA, where B ∈ R^(d×r) and A ∈ R^(r×k) with rank r ≪ min(d, k). During fine-tuning, only A and B are trained, reducing the number of trainable parameters from d×k to r×(d+k). For a d=4096 projection with r=8, LoRA reduces trainable parameters by ~256×. At inference, the update can be merged back into W with no added latency.
PPO (Proximal Policy Optimisation) is an online RL algorithm that trains the language model policy against a separately trained reward model, using a clipped surrogate objective and a KL penalty to prevent instability. DPO (Direct Preference Optimisation) re-parameterises the RLHF objective to eliminate the need for an explicit reward model and RL loop: it directly optimises the policy on preference pairs using a binary cross-entropy loss. DPO is simpler to implement and more stable, but it is offline and may be less effective when the preference data distribution differs from the model's current distribution.
The four primary paradigms are: Data Parallel (DDP) - each GPU holds a full model copy and gradients are AllReduced; Fully Sharded Data Parallel (FSDP) - parameters, gradients, and optimiser states are sharded across GPUs for large-model training; Tensor Parallelism - individual weight matrices are split across GPUs (requiring AllReduce within each layer); and Pipeline Parallelism - model layers are split across stages with micro-batch interleaving to hide bubble overhead. Know when to use each and the communication primitives each relies on (AllReduce, AllGather, ReduceScatter, P2P). See the ML System Design section for a side-by-side comparison table.
A general SE interview at most tech companies is almost entirely algorithms coding plus a system design round. An ML research scientist interview adds three rounds that SE loops do not have: ML coding (implementing model components from scratch, not calling library functions), ML theory (derivations, not definitions), and a research discussion round where a senior researcher probes the depth and originality of past work. The algorithms coding bar is usually similar or slightly lower than at FAANG-style SE interviews, but the combined technical surface area is larger. ML Engineer interviews sit in between - heavier on ML system design and production concerns, lighter on open-ended research discussion.
Write down what went wrong while it is fresh - the specific question, where the explanation broke down, and what the correct approach was - then close the laptop and stop reviewing it for the rest of the day. Rehashing a single round rarely improves the next one and reliably damages confidence going into it. A single weak round is not a reliable predictor of the final decision: interview panels typically weight the round average and look for consistent strengths, not a single low score in isolation. If a debrief or feedback call is offered, take it - specific, dated feedback is one of the few high-signal artifacts available during the process.
Less common at the largest labs (DeepMind, OpenAI, Anthropic, Meta AI), more common at mid-size AI companies and startups, where they often replace a live ML coding round entirely. Typical formats: a small modelling task with a provided dataset and a write-up, or a focused implementation exercise (e.g., implement and train a small transformer on a toy dataset) with a time budget of a few hours to a few days. Treat the write-up with the same care as the code - reviewers read it as a proxy for how the candidate would communicate results to a research team. Clarify the time budget and evaluation criteria before starting if they are not specified.