OpenAI & LLM APIs Interview Questions
40 interview questions on OpenAI APIs, function calling, streaming, chat completions, vision, TTS, DALL-E, and multi-modal capabilities.
40 interview questions on OpenAI APIs, function calling, streaming, chat completions, vision, TTS, DALL-E, and multi-modal capabilities. This interview-focused guide covers essential openai & llm apis interview questions concepts for technical interviews.
OpenAI & LLM APIs Interview Questions
Production LLM applications rely heavily on API integration. These 40 questions cover OpenAI's Chat Completions API, function calling, streaming, vision, embeddings, rate limiting, error handling, and API design patterns for Anthropic, Google, Cohere, and open-source providers.
1. What is the OpenAI Chat Completions API?
The Chat Completions API (/v1/chat/completions) is OpenAI's primary endpoint for conversational AI. It accepts an array of messages (system, user, assistant) and optional parameters (model, temperature, max_tokens, tools). Returns a completion object with choices, usage, and finish reason.
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain REST APIs briefly."}
],
temperature=0.7,
max_tokens=200
)
print(response.choices[0].message.content)
2. What is the difference between gpt-4o and gpt-4-turbo?
| Feature | gpt-4o | gpt-4-turbo |
|---|---|---|
| Speed | Much faster (~2×) | Slower |
| Cost | Cheaper | More expensive |
| Vision | Native multimodal | Separate API |
| Context | 128K | 128K |
| Audio | Native I/O | Not native |
| Knowledge cutoff | More recent | Earlier |
3. What is Function Calling (Tool Use)?
Function calling allows the model to request external function execution. The model doesn't execute functions — it returns structured JSON with function name and arguments. Your code executes the function and returns results to the model.
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"}
},
"required": ["city"]
}
}
}]
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What's the weather in London?"}],
tools=tools,
tool_choice="auto"
)
# Check response.choices[0].message.tool_calls
4. How does Streaming work?
Streaming sends tokens as they're generated instead of waiting for the full response. Set stream=True. Uses Server-Sent Events (SSE). Reduces time-to-first-token dramatically.
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
5. What are Token Usage and Pricing?
Tokens are the pricing unit:
- Input tokens: Prompt + conversation history (cheaper)
- Output tokens: Generated response (more expensive)
- Batch API: 50% discount for 24hr async processing
- Usage returned in
response.usage:prompt_tokens,completion_tokens,total_tokens
6. What is the OpenAI Python client structure?
from openai import OpenAI
client = OpenAI(api_key="sk-...") # Or env: OPENAI_API_KEY
# Key endpoints:
# client.chat.completions.create()
# client.embeddings.create()
# client.images.generate()
# client.audio.speech.create()
# client.audio.transcriptions.create()
# client.files.create()
# client.fine_tuning.jobs.create()
# client.moderations.create()
7. How do you handle rate limits?
OpenAI imposes RPM (requests per minute) and TPM (tokens per minute) limits:
- Implement exponential backoff with jitter
- Use
tenacitylibrary for retries - Monitor
x-ratelimit-remaining-requestsandx-ratelimit-remaining-tokensheaders - Use batch API for high-volume async processing
from tenacity import retry, wait_exponential, stop_after_attempt
@retry(wait=wait_exponential(multiplier=1, min=4, max=60), stop=stop_after_attempt(5))
def call_openai(messages):
return client.chat.completions.create(model="gpt-4o", messages=messages)
8. What is the system message role?
System messages set the assistant's behavior, personality, and constraints. They carry more weight than user messages. Best practices: be specific, use clear rules, set boundaries ("Only answer questions about Python programming").
9. What is the assistant role?
Assistant messages represent previous model responses in a conversation. They're crucial for multi-turn chats — they maintain context and demonstrate the desired response pattern.
10. What are finish reasons?
- stop: Natural completion or stop sequence hit
- length: Max tokens reached, response truncated
- tool_calls: Model requested function call
- content_filter: Content flagged by moderation
- function_call (deprecated): Use tool_calls instead
11. What is the max_tokens parameter?
max_tokens limits the response length. The model stops generating when it reaches this limit. Important: max_tokens + prompt tokens must be ≤ model's context window. Set reasonably to control costs and prevent overlong completions.
12. What are stop sequences?
Stop sequences are strings that terminate generation: [".", "END", "\n\n\n"]. Useful for structured outputs — the model stops when it generates the sequence.
13. What is n parameter?
n generates multiple completions from a single prompt. n=3 returns 3 different responses. Costs 3× as much. Useful for generating diverse options or for best-of-N ensembling.
14. What are logprobs?
logprobs return the log probabilities of generated tokens. Used for:
- Understanding model uncertainty
- Confidence scoring
- Constrained generation
- Token-level analysis
15. What is the seed parameter?
seed enables reproducible outputs (deterministic). Same seed + same system_fingerprint → same output. Beta feature. Not 100% guaranteed but significantly reduces variance.
16. What is the response_format parameter?
response_format ensures structured output:
{"type": "text"}— Default plain text{"type": "json_object"}— JSON output (must mention "JSON" in prompt)- Structured Outputs (Beta): JSON Schema via
response_format={"type": "json_schema", "json_schema": {...}}
17. What is GPT-4 Vision?
Vision allows the model to process images. Pass images as base64 or URLs in user messages. Supports multiple images per request. Analyze charts, diagrams, screenshots, photographs.
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
}]
)
18. What is the Embeddings API?
Embeddings API converts text to vectors. Models: text-embedding-3-small (1536d, cheap), text-embedding-3-large (3072d, better), text-embedding-ada-002 (legacy, 1536d).
response = client.embeddings.create(
model="text-embedding-3-small",
input="Your text here",
encoding_format="float"
)
vector = response.data[0].embedding
19. What is DALL-E API?
DALL-E generates images from text prompts. Models: dall-e-3 (highest quality), dall-e-2 (faster). Parameters: quality (standard/hd), size, style (vivid/natural).
response = client.images.generate(
model="dall-e-3",
prompt="A serene lake at sunset with mountains",
size="1024x1024",
quality="hd",
n=1
)
image_url = response.data[0].url
20. What is the TTS (Text-to-Speech) API?
TTS converts text to natural-sounding speech. Models: tts-1 (faster), tts-1-hd (better quality). Voices: alloy, echo, fable, onyx, nova, shimmer.
21. What is the Whisper API?
Whisper transcribes audio to text. Supports multiple languages, translation to English. File size: max 25MB. Formats: mp3, mp4, mpeg, wav, webm.
22. What is the Moderation API?
Moderations checks content for safety violations. Free to use. Categories: hate, self-harm, sexual, violence. Returns flagged status and category scores.
23. How do you manage conversation history with the API?
Client-side management: maintain a messages array, append user/assistant messages. Trim oldest messages when approaching context limit. Use tiktoken to count tokens.
messages = [
{"role": "system", "content": "You are helpful."}
]
while True:
user_input = input("You: ")
messages.append({"role": "user", "content": user_input})
response = client.chat.completions.create(model="gpt-4o", messages=messages)
assistant_msg = response.choices[0].message.content
messages.append({"role": "assistant", "content": assistant_msg})
24. What are Parallel Function Calls?
The model can request multiple functions in a single response. Each function call is handled independently, and all results are returned together. Reduces round-trips for multi-step operations.
25. What is error handling in OpenAI API?
Common errors:
- 401 Unauthorized: Invalid API key
- 429 Rate Limit: Too many requests
- 500 Server Error: OpenAI outage
- context_length_exceeded: Prompt too long
- Always implement retries with exponential backoff
26. What is the Batch API?
Batch API processes requests asynchronously at 50% discount. Submit a .jsonl file of requests, receive results within 24 hours. No rate limits. Best for non-interactive, bulk workloads.
27. How does Anthropic's Claude API compare to OpenAI?
| Feature | OpenAI | Anthropic Claude |
|---|---|---|
| Model names | gpt-4o, gpt-4-turbo | claude-3-opus, sonnet, haiku |
| Context | 128K | 200K |
| Function calling | Native | Tool use |
| Safety | Moderation API | Constitutional AI |
| Streaming | SSE | SSE |
| Vision | Yes | Yes (claude-3) |
| API style | Parameter-focused | Message-focused |
28. What are the best practices for estimating token usage?
- Use
tiktokento count tokens before API calls - Track cumulative usage with
response.usage - Set conservative max_tokens
- Use cheaper models for drafting (gpt-3.5-turbo for first pass, gpt-4o for final)
29. How do you implement caching for API calls?
- Exact match cache: Same prompt → same response (deterministic at temp=0)
- Semantic cache: Similar prompt → cached response (embedding-based)
- Use Redis/Memcached with TTL
- Cache embeddings to avoid recomputation
30. What is the difference between OpenAI and Azure OpenAI?
- OpenAI: Direct API. Global availability. Simpler integration.
- Azure OpenAI: Enterprise features: VPC, RBAC, private endpoints, compliance (SOC2, HIPAA), content filtering, SLAs. Same models.
31. What are the Anthropic Claude API specific features?
- Extended Thinking: Claude shows internal reasoning process
- Tool use: Native tool calling with
toolsparameter - System prompt: Separated from messages like OpenAI
- Usage: Shows cache creation/read tokens for prompt caching
32. What is prompt caching (Anthropic)?
Claude can cache long system prompts and large contexts. Cached tokens cost 10% of standard input price. Dramatically reduces cost for applications with large, static context.
33. How do you use Google's Gemini API?
Google AI Studio + Vertex AI. Gemini models: gemini-1.5-flash (fast/cheap), gemini-1.5-pro (powerful). 1M token context. Native multimodality (text, image, audio, video). Safety settings configurable.
34. What is the Cohere API best for?
Cohere specializes in enterprise search and RAG: strong embedding models, built-in reranker, chat with RAG connector, multilingual models. Complementary to general-purpose LLM APIs.
35. What is Ollama?
Ollama runs open-source LLMs locally (Llama, Mistral, Phi, Gemma). Local API compatible with OpenAI client. No data leaves your machine. Use for development, privacy-sensitive, or offline scenarios.
36. What are API key management best practices?
- Never hardcode keys in source code
- Use environment variables or secret managers
- Rotate keys regularly
- Set usage limits per key
- Use separate keys for dev/staging/prod
- Monitor key usage in dashboard
37. What is the Assistants API?
Assistants API provides persistent AI assistants with:
- Thread management (conversation state)
- Code Interpreter (Python sandbox)
- File Search (RAG with managed storage)
- Function calling (tool integration) Higher-level than raw chat completions.
38. What is the difference between Open Source models (via API) and proprietary APIs?
- Open Source (Together AI, Anyscale, Groq): Cheaper, fine-tunable, self-hostable. Slightly lower quality.
- Proprietary (OpenAI, Anthropic): Highest quality, managed infrastructure, research investment. More expensive.
39. How do you implement fallback between API providers?
try:
response = call_gpt4(prompt)
except (RateLimitError, APIError):
response = call_claude(prompt)
except Exception:
response = call_llama(prompt) # Last resort
40. What is the cost optimization strategy for LLM APIs?
- Use smaller/cheaper models for simple tasks
- Batch non-urgent requests
- Enable prompt caching where available
- Pre-compute and cache embeddings
- Use streaming to improve perceived speed
- Set aggressive max_tokens to prevent overspending
- Consider fine-tuned small models replacing large ones
AI Mentor
Confused about "OpenAI APIs, function calling, streaming, vision, DALL-E, TTS, embedding APIs, Anthropic Claude, Google Gemini, and API integration patterns"? Ask our AI mentor for a simplified explanation.
Quiz
Quiz
Question 1 of 3What is the purpose of function calling (tool use) in LLM APIs?