AI Engineering & Prompt Engineering Interview Questions
Reviewed by Mark Dickie · Last updated
Prompt engineering is the practice of designing and structuring inputs to large language models (LLMs) to reliably produce correct, safe, and useful outputs. Interviews in this area test whether you can reason about model behavior, not just write clever prompts. Expect questions on prompting techniques like chain-of-thought and few-shot learning, on how to build and evaluate LLM-backed systems, and on failure modes such as hallucination, prompt injection, and context-window constraints. A solid candidate knows both why a technique works and when to reach for a different one.
Core prompting techniques you should know
The table below maps each technique to its primary use case and a key trade-off interviewers often probe.
| Technique | When to use it | Key trade-off | |---|---|---| | Zero-shot prompting | Quick tasks where the model generalizes well | No examples to anchor the format; output can drift | | Few-shot prompting | Tasks needing a specific output format or tone | Token cost rises with each example added | | Chain-of-thought (CoT) | Multi-step reasoning: math, logic, planning | Slower and more expensive per call; can still hallucinate steps | | Retrieval-Augmented Generation (RAG) | Grounding answers in current or private data | Adds retrieval latency; retrieval quality sets a ceiling on answer quality | | System prompt / role prompting | Setting persistent behavior and persona constraints | Model may ignore or "leak" the system prompt under adversarial input | | Self-consistency | High-stakes reasoning where one sample is risky | Requires multiple completions; multiplies cost and latency |
How to structure your preparation
Work through these areas in order — each builds on the one before it.
- Model fundamentals. Know how autoregressive generation works, what temperature and top-p actually control, and why token limits matter architecturally. You cannot debug a pipeline you do not understand at this level.
- Prompting patterns. Be able to write and critique zero-shot, few-shot, and chain-of-thought prompts. Practice explaining why a reformulation improves results, not just showing that it does.
- Pipeline design. Understand how RAG, tool use, and multi-step agent loops fit together. Interviewers at AI-engineering-focused companies often ask you to whiteboard a retrieval pipeline or an agent with memory.
- Evaluation and safety. Know at least two ways to measure LLM output quality (e.g., LLM-as-judge vs. human eval), and be ready to discuss prompt injection and jailbreaking as real engineering risks, not abstract concerns.
- Cost and latency trade-offs. Be prepared to compare approaches on token usage and response time — this separates candidates who have shipped production systems from those who have only experimented in notebooks.
Difficulty in this quiz ranges from entry-level definitions (level 1) up to system-design and failure-mode analysis (level 5), so the questions ahead will stretch across all five tiers.
At a glance
| Questions | 15 |
|---|---|
| Difficulty | 2–5 of 5 |
| Formats | Multiple choice, Multiple answer, True / false |
What you'll review
- prompt engineering
- sampling temperature
- embeddings
- rag basics
- structured output
- tool calling
- prompt injection
- retrieval quality
- latency cost
- hallucination
- mcp
- tokens context
- llm caching
Practice questions
AI Engineering/prompting/prompt-engineering
Adding the phrase "Think step by step" (or an equivalent chain-of-thought cue) to a prompt most reliably helps with which class of task?
Options
- Retrieving a memorized fact, such as the capital of a country
- Multi-step reasoning tasks — arithmetic, logic, and compositional problems that require several intermediate inferences
- Generating a short, creative product name with no reasoning required
- Transcribing speech to text accurately
Show answer
Chain-of-thought prompting most reliably helps with multi-step reasoning tasks — arithmetic, logic, and compositional problems that require several intermediate inferences. By making the model externalize its reasoning before committing to an answer, it gains scratch-pad space to decompose problems where a wrong intermediate step would otherwise cascade into a wrong answer. It adds nothing to pure fact recall, creative naming with no reasoning steps, or speech transcription.
Chain-of-thought (CoT) prompting works by making the model externalize its reasoning tokens before committing to an answer, which gives it "scratch-pad" space to decompose problems that require multiple inference steps — math, logic puzzles, multi-hop questions (b). It shows the largest gains on tasks where a wrong intermediate step would otherwise cascade silently into a wrong final answer. On pure fact recall (a) the answer either lives in the model's weights or it doesn't; CoT adds tokens without changing what is remembered. Creative naming (c) does not have intermediate logical steps to unroll. Speech transcription (d) is not a text-in/text-out reasoning task at all.
AI Engineering/prompting/prompt-engineering
In the messages API (system / user / assistant roles), what is the primary purpose of the system prompt vs. the user turn?
Options
- The system prompt is encrypted and never seen by the model; the user turn is where all instructions must go
- The system prompt sets persistent persona, constraints, and task framing that apply to the whole session; the user turn carries the specific per-request input from the end user
- They are identical in semantics; the split is only cosmetic so the UI can display them in different colors
- The system prompt is processed after the user turn so it can override anything the user says
Show answer
The system prompt sets persistent persona, constraints, and task framing that apply to the whole session; the user turn carries the specific per-request input from the end user. The developer controls the system prompt to set the model's role, tone, hard constraints, and background context that frames every message. The model sees both — the system prompt is not hidden, encrypted, or merely cosmetic, and it precedes the user turn rather than overriding it by being processed last.
The system prompt is a privileged, persistent context that the developer controls: it sets the model's role, tone, hard constraints, tool definitions, and any background context that should frame every user message in the session (b). The user turn carries what the end user typed — the per-request input. The model sees both; the system prompt is not hidden (a). They are semantically distinct: system content is treated as developer intent and carries higher implicit trust in most architectures (c is wrong). Ordering is fixed — system precedes user in the token stream; it does not override by being processed last (d).
AI Engineering/prompting/prompt-engineering
Research on few-shot prompting shows models are sensitive to the order and recency of examples in the context. Which bias is most consistently documented?
Options
- Primacy bias: the first example in the list always dominates, regardless of how many follow
- Recency bias: the model disproportionately reflects the style or label of the examples nearest to the query (the last ones)
- Middle bias: examples placed in the middle of a long list carry the most weight because they are farthest from both boundaries
- No positional effect: models weight each example identically regardless of where it appears
Show answer
The most consistently documented effect is recency bias: the model disproportionately reflects the style or label of the examples nearest to the query, the last ones. This is why few-shot example order matters and why practitioners often place the most representative example last. Primacy bias describes a different multi-choice phenomenon, the middle of a long context is often where models attend least (lost-in-the-middle), and positional invariance is empirically false for current models.
Multiple studies on large-language-model in-context learning have found a recency (or end-of-context) bias: examples positioned closest to the query tend to have an outsized influence on the output's label or style (b). This is why the order of few-shot examples matters and why practitioners often put the most representative or desired-format example last. Primacy bias (a) describes an effect in multi-choice settings where the first listed option gets over-selected — a different phenomenon, not specific to few-shot example ordering. The "middle" of a long context is often where models attend least (sometimes called the lost-in-the-middle effect), making (c) the opposite of the documented finding. Positional invariance (d) is empirically false for current models.
AI Engineering/llm-foundations/sampling-temperature
You are building a feature that extracts structured fields from invoices and must return the same output for the same input every time. Which sampling change moves you toward that goal?
Options
- Raise
temperaturetoward 1.0 to give the model more flexibility - Set
temperatureto 0 (and keep the prompt fixed) - Increase
top_pto 0.99 so more tokens are considered - Raise the
frequency_penaltyso tokens are not repeated
Show answer
Set temperature to 0 and keep the prompt fixed. At temperature 0 the model becomes near-greedy, picking the highest-probability token at each step, which is what you want for reproducible extraction. Raising temperature, widening top_p, or adding a frequency_penalty all increase variability instead of removing it. Note one honest caveat: GPU floating-point quirks and model updates can still cause small variation, so treat this as low variance, not a hard guarantee.
temperature scales the logits before sampling; at 0 the model becomes (near) greedy — it picks the highest-probability token at each step — which is what you want for deterministic, reproducible extraction. Note the honest caveat: even at temperature 0 you can still see small variation in practice from non-deterministic floating-point reduction order on GPUs, batching, and model updates, so treat it as low variance, not a hard guarantee. Raising temperature (a) does the opposite — it flattens the distribution and increases randomness. Raising top_p (c) widens the nucleus of candidate tokens, again adding variability rather than removing it. frequency_penalty (d) only discourages token repetition; it changes what is generated but does nothing for determinism.
AI Engineering/llm-foundations/embeddings
In a semantic search system, you embed a query and compare it to document embeddings with cosine similarity. What does cosine similarity actually measure?
Options
- The straight-line (Euclidean) distance between the two vectors
- The angle between the two vectors, ignoring their magnitudes
- The number of dimensions the two vectors share exactly
- The token overlap between the two original texts
Show answer
Cosine similarity measures the angle between two vectors, ignoring their magnitudes. It is the dot product divided by the product of the magnitudes, so it depends only on direction, not length. That is why it suits embeddings: two passages on the same topic point the same way regardless of how long each text is. It is not Euclidean distance, shared-dimension counting, or raw token overlap.
Cosine similarity is the cosine of the angle between two vectors: it is the dot product divided by the product of the magnitudes, so it depends only on direction, not length. That is why it works well for embeddings — two passages about the same topic point the same way regardless of how long each text is. Euclidean distance (a) is a different metric that is sensitive to magnitude (though on length-normalized vectors the two ranking orders coincide). "Shared dimensions" (c) is not a real similarity measure for dense embeddings, whose dimensions are not independently interpretable. Token overlap (d) describes lexical/keyword matching (e.g. BM25), which is exactly what dense embeddings are meant to go beyond.
AI Engineering/rag/rag-basics
Your support bot must answer from a knowledge base that changes daily and must cite the source document for each answer. Which approach is the most appropriate primary strategy?
Options
- Fine-tune the base model nightly on the latest knowledge base
- Retrieval-augmented generation (RAG): retrieve relevant docs at query time and pass them in context
- Put the entire knowledge base into the system prompt for every request
- Train a LoRA adapter once on a snapshot and reuse it indefinitely
Show answer
Use retrieval-augmented generation (RAG): retrieve the relevant documents at query time and pass them into the context. RAG is the right default when knowledge is large, changes frequently, and answers need attribution, because updates are just re-indexing and you can cite the chunks you retrieved. Nightly fine-tuning and one-time LoRA adapters bake facts into weights that go stale and cannot cite sources; stuffing the whole knowledge base into the prompt blows the context window and dilutes quality.
RAG is the right default when knowledge is large, changes frequently, and answers need attribution: you index the documents, retrieve the relevant chunks per query, and the model answers from them — so updates are just re-indexing, and you can cite the chunks you retrieved. Nightly fine-tuning (a) is slow, expensive, hard to attribute (the model can't reliably cite which training example produced an answer), and it bakes facts into weights where they go stale between runs. Stuffing the whole KB into the system prompt (c) blows the context window and cost, and degrades quality as irrelevant text dilutes the signal. A one-time LoRA on a snapshot (d) is immediately stale for daily-changing data and still can't cite sources. Fine-tuning earns its place for behavior/format/tone, not volatile facts.
AI Engineering/prompting/structured-output
You need the model to return JSON that always conforms to a specific schema (exact keys, types, and enums) so your downstream parser never crashes. Which mechanism gives the strongest guarantee?
Options
- Add "Respond only in JSON" to the prompt and parse the result
- Use the provider's constrained/structured-output feature that enforces a supplied JSON Schema (grammar-constrained decoding)
- Lower the temperature to 0 so the JSON is always identical
- Ask for JSON and retry up to three times on a parse error
Show answer
Use the provider's constrained or structured-output feature that enforces a supplied JSON Schema through grammar-constrained decoding. It restricts the token sampler at each step to tokens that keep the output valid against the schema, so the result is guaranteed parseable and schema-conformant by construction. A free-text instruction is best-effort, temperature 0 only makes a malformed output deterministically malformed, and retry-on-error is a reasonable fallback but offers no hard guarantee on any single attempt.
Provider structured-output features (e.g. JSON Schema-constrained decoding) restrict the token sampler at each step to tokens that keep the output valid against the schema, so the result is guaranteed-parseable and schema-conformant by construction. A free-text instruction (a) is best-effort: the model can still emit prose, trailing commas, or extra keys. Temperature 0 (c) only reduces variability — a deterministic output can be deterministically malformed. Retry-on-error (d) is a reasonable fallback and improves reliability, but it adds latency and cost and still has no hard guarantee on any single attempt; constrained decoding removes the failure mode at the source.
AI Engineering/agents/tool-calling
When an LLM "calls a tool" (function calling), what does the model itself actually produce in the API response?
Options
- The model executes the function on the provider's servers and returns the result
- A structured request naming the tool and its arguments; your application runs it and feeds the result back
- Raw Python that your runtime must
evalto get the answer - A natural-language description of the function it wishes existed
Show answer
The model produces a structured request naming the tool and its arguments; your application runs the tool and feeds the result back. Function calling is a protocol, not remote execution: given your tool definitions, the model decides whether to call one and emits structured arguments, then your code executes it and returns the result as a tool message for the next turn. The provider never runs your code, and the call is structured data, not raw code to eval.
Function/tool calling is a protocol, not remote execution: given your tool definitions (name, description, JSON-Schema parameters), the model decides whether to call one and emits a structured tool-call with arguments. Your code executes the tool, then sends the result back as a tool/function message so the model can use it in its next turn. The provider does not run your code (a) — it has no access to your database or APIs. The model returns structured arguments, not executable code to eval (c); eval-ing model output would be a serious injection risk. And it is a concrete, parseable call, not a vague wish (d).
AI Engineering/evaluation-safety/prompt-injection
Your agent reads untrusted web content and has tools that can read files and call internal APIs. Which of the following are genuine, meaningful mitigations against prompt injection?
Options
- Apply least privilege to tools and require human approval (or hard policy checks) for high-impact actions
- Clearly delimit untrusted content and instruct the model to treat it as data, not instructions
- Raise the temperature so the model is less predictable to attackers
- Validate and constrain tool outputs/arguments before executing them (allow-lists, schemas, sandboxing)
- Trust the system prompt to always win because it appears first
Show answer
The genuine mitigations are applying least privilege to tools with human approval for high-impact actions, clearly delimiting untrusted content and labeling it as data not instructions, and validating or sandboxing tool outputs and arguments before executing them. These are layered and architectural. Raising the temperature does nothing for safety, and trusting the system prompt to always win is false: there is no hard precedence boundary in the token stream, so crafted injections routinely override prepended instructions.
Effective defenses are layered and architectural. Least privilege plus human-in-the-loop for dangerous actions (a) limits blast radius even when an injection succeeds. Delimiting untrusted text and labeling it as data (b) helps the model resist hijacking — it is necessary but not sufficient on its own. Validating/sandboxing what tools receive and do (d) stops a hijacked call from causing real damage. Raising temperature (c) does nothing for safety; it just adds randomness and can make the system less reliable. "The system prompt always wins" (e) is false — there is no hard precedence boundary in the token stream, and crafted injections routinely override prepended instructions, which is exactly why you cannot rely on prompt ordering alone.
AI Engineering/rag/retrieval-quality
A RAG system returns confident but wrong answers because the retrieved chunks are often irrelevant. Which changes are legitimate levers to improve retrieval quality?
Options
- Add a reranker (e.g. a cross-encoder) over the top-k candidates before passing them to the model
- Tune chunk size and overlap so chunks are semantically coherent and self-contained
- Combine dense (vector) retrieval with sparse keyword search (hybrid retrieval)
- Raise the generation model's
max_tokensso it can write a longer answer - Switch to a stronger embedding model better matched to your domain
Show answer
The legitimate levers are adding a reranker such as a cross-encoder over the top-k candidates, tuning chunk size and overlap so chunks are coherent and self-contained, combining dense vector retrieval with sparse keyword search (hybrid retrieval), and switching to a stronger embedding model matched to your domain. Each improves which chunks reach the context. Raising the generation model's max_tokens only changes how long the answer may be — it does nothing about which documents were retrieved.
Retrieval quality is about getting the right chunks into context. A reranker (a) reorders the initial candidate set with a more expensive, more accurate cross-encoder so the best passages float to the top. Chunking strategy (b) directly affects whether a chunk contains a complete, embeddable idea rather than a fragment. Hybrid retrieval (c) catches cases where exact terms/IDs matter that dense vectors miss, and vice versa. A better-matched embedding model (e) improves the similarity signal at the source. Raising max_tokens (d) only changes how long the answer may be — it does nothing about which documents were retrieved, so it cannot fix retrieving the wrong material.
AI Engineering/ai-production/latency-cost
A chat feature feels slow and is expensive at scale. Which techniques are valid ways to reduce latency and/or cost in a production LLM application?
Options
- Stream tokens to the client to lower perceived latency (time-to-first-token)
- Route easy requests to a smaller/cheaper model and reserve the large model for hard ones
- Cache responses (or prompt prefixes) for repeated or near-identical requests
- Always pad every prompt with extra few-shot examples to be safe
- Trim unnecessary context and cap
max_tokensto what the task needs
Show answer
Valid levers are streaming tokens to lower perceived latency, routing easy requests to a smaller cheaper model while reserving the large model for hard ones, caching responses or prompt prefixes for repeated requests, and trimming unnecessary context while capping max_tokens to what the task needs. Each cuts cost, latency, or both. Padding every prompt with extra few-shot examples does the opposite: it inflates input tokens on every call and beyond a point adds no accuracy.
Streaming (a) doesn't change total compute but dramatically improves perceived speed by showing the first tokens immediately. Model routing / cascading (b) sends the bulk of easy traffic to a cheaper model, cutting average cost and latency while preserving quality on the hard tail. Caching (c) avoids paying for work you've already done. Trimming context and bounding output length (e) reduces both input and output tokens, which is where the bill and the time go. Indiscriminately padding every prompt with more few-shot examples (d) does the opposite — it inflates input tokens (cost and latency) on every call, and beyond a point adds no accuracy, so it is a regression, not an optimization.
AI Engineering/evaluation-safety/hallucination
You need to reduce hallucinations in a factual Q&A assistant. Which of the following meaningfully reduce or detect ungrounded answers?
Options
- Ground answers in retrieved sources and instruct the model to answer only from them
- Allow the model to respond "I don't know" when the context lacks the answer
- Require citations and verify that cited claims actually appear in the retrieved text
- Increase
temperatureto encourage more creative, detailed answers - Tell the model in the prompt to "never hallucinate and always be 100% accurate"
Show answer
What meaningfully reduces or detects ungrounded answers is grounding answers in retrieved sources with an instruction to answer only from them, allowing the model to say it does not know when the context lacks the answer, and requiring citations whose claims you verify against the retrieved text. These anchor the model to checkable evidence. Raising temperature makes confident fabrication more likely, and a bare instruction to never hallucinate is unenforceable wishful thinking.
Hallucination drops when the model is anchored to real evidence: retrieval-grounding with an instruction to answer only from context (a) constrains it to supported claims. An explicit "I don't know" escape hatch (b) gives the model a correct option other than fabricating, which it otherwise tends to avoid. Citation + verification (c) turns grounding into something you can check, catching claims that aren't actually supported. Raising temperature (d) increases randomness and makes confident fabrication more likely, not less. A bare instruction to "never hallucinate" (e) is wishful — the model has no reliable internal signal of its own factuality, so an unenforceable command doesn't change behavior in a measurable way; grounding and verification do.
AI Engineering/agents/mcp
Your team is evaluating the Model Context Protocol (MCP) for connecting LLM applications to tools and data. Which statements about MCP are accurate?
Options
- MCP is an open protocol that standardizes how applications expose tools, resources, and prompts to LLM clients
- It lets you build a tool/data server once and reuse it across any MCP-compatible client (host)
- An MCP server still runs with the privileges you grant it, so connecting an untrusted server is a real security and prompt-injection risk
- MCP replaces the need for the model to do function/tool calling at all
- MCP guarantees the LLM cannot be misled by malicious content returned from a server
Show answer
The accurate statements are that MCP is an open protocol standardizing how applications expose tools, resources, and prompts to LLM clients, that you build a tool or data server once and reuse it across any MCP-compatible host, and that an MCP server still runs with the privileges you grant it, so an untrusted server is a real security and injection risk. MCP does not replace function calling — it is the transport for it — and it gives no guarantee against malicious server content misleading the model.
MCP is an open, client-server protocol that gives a uniform way to expose tools, resources, and prompts to LLM hosts (a), and its central payoff is write-once/reuse-everywhere interoperability across compatible clients (b). It does not remove the security burden: a server runs with whatever access you give it and the data it returns enters the model's context, so an untrusted or compromised server is a genuine injection/exfiltration risk (c) — least privilege and review still apply. MCP does not replace function calling (d); it is the transport/standard through which tool definitions and calls flow — the model still decides which tool to invoke. And it offers no guarantee against malicious server output misleading the model (e); content returned over MCP is untrusted data like any other.
AI Engineering/llm-foundations/tokens-context
A model's context window is shared between the input (prompt) tokens and the generated output tokens — they draw from the same budget.
Show answer
True. The context window bounds the total tokens the model attends to — system prompt, retrieved context, conversation history, and the tokens it generates all draw from the same budget. If you stuff the input close to the limit, you starve the output and risk truncated completions. Budget explicitly: reserve headroom for max_tokens of output, since long outputs cost both latency and window space.
The context window bounds the total tokens the model attends to: system prompt + retrieved context + conversation history + the tokens it generates. If you stuff the input close to the limit, you starve the output and risk truncated completions. Budget explicitly — reserve headroom for max_tokens of output, and remember that long outputs cost both latency and window space.
AI Engineering/ai-production/llm-caching
With provider prompt caching, putting a large, stable system prompt and document context at the start of the request (with the variable user input last) can substantially cut per-call cost and latency on repeated requests.
Show answer
True. Prompt caching keys on a prefix of the request, so a long unchanging prefix — system instructions, few-shot examples, a fixed knowledge block — can be cached once and reused, with cache reads billed at a steep discount and skipping recomputation, which lowers latency. The catch is ordering: anything that varies must come after the stable prefix, otherwise it breaks the cacheable prefix and you pay full price. Structure prompts stable-first, variable-last.
Prompt caching keys on a prefix of the request, so a long, unchanging prefix (system instructions, few-shot examples, a fixed knowledge block) can be cached once and reused — cache reads are billed at a steep discount and skip recomputation, lowering latency. The catch: ordering matters. Anything that varies must come after the stable prefix, otherwise it breaks the cacheable prefix and you pay full price every call. Structure prompts stable-first, variable-last to maximize hits.
Related interview questions
Job market
See ai-engineering salaries and hiring demand from live job postings.
Practice this for real
CodePrep turns your target job description into an adaptive quiz from a bank of tagged questions, scores your answers, and resurfaces the topics you miss.