Gen-ai Interview Questions

OpenAI & LLM APIs Interview Questions

40 interview questions on OpenAI APIs, function calling, streaming, chat completions, vision, TTS, DALL-E, and multi-modal capabilities.

By TechCoder TeamLast updated: 2026-06-23

In a Nutshell

40 interview questions on OpenAI APIs, function calling, streaming, chat completions, vision, TTS, DALL-E, and multi-modal capabilities. This interview-focused guide covers essential openai & llm apis interview questions concepts for technical interviews.

OpenAI & LLM APIs Interview Questions

Production LLM applications rely heavily on API integration. These 40 questions cover OpenAI's Chat Completions API, function calling, streaming, vision, embeddings, rate limiting, error handling, and API design patterns for Anthropic, Google, Cohere, and open-source providers.

1. What is the OpenAI Chat Completions API?

The Chat Completions API (/v1/chat/completions) is OpenAI's primary endpoint for conversational AI. It accepts an array of messages (system, user, assistant) and optional parameters (model, temperature, max_tokens, tools). Returns a completion object with choices, usage, and finish reason.

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain REST APIs briefly."}
    ],
    temperature=0.7,
    max_tokens=200
)
print(response.choices[0].message.content)

2. What is the difference between gpt-4o and gpt-4-turbo?

Feature	gpt-4o	gpt-4-turbo
Speed	Much faster (~2×)	Slower
Cost	Cheaper	More expensive
Vision	Native multimodal	Separate API
Context	128K	128K
Audio	Native I/O	Not native
Knowledge cutoff	More recent	Earlier

3. What is Function Calling (Tool Use)?

Function calling allows the model to request external function execution. The model doesn't execute functions — it returns structured JSON with function name and arguments. Your code executes the function and returns results to the model.

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"}
            },
            "required": ["city"]
        }
    }
}]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's the weather in London?"}],
    tools=tools,
    tool_choice="auto"
)
# Check response.choices[0].message.tool_calls

4. How does Streaming work?

Streaming sends tokens as they're generated instead of waiting for the full response. Set stream=True. Uses Server-Sent Events (SSE). Reduces time-to-first-token dramatically.

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

5. What are Token Usage and Pricing?

Tokens are the pricing unit:

Input tokens: Prompt + conversation history (cheaper)
Output tokens: Generated response (more expensive)
Batch API: 50% discount for 24hr async processing
Usage returned in response.usage: prompt_tokens, completion_tokens, total_tokens

6. What is the OpenAI Python client structure?

from openai import OpenAI

client = OpenAI(api_key="sk-...")  # Or env: OPENAI_API_KEY

# Key endpoints:
# client.chat.completions.create()
# client.embeddings.create()
# client.images.generate()
# client.audio.speech.create()
# client.audio.transcriptions.create()
# client.files.create()
# client.fine_tuning.jobs.create()
# client.moderations.create()

7. How do you handle rate limits?

OpenAI imposes RPM (requests per minute) and TPM (tokens per minute) limits:

Implement exponential backoff with jitter
Use tenacity library for retries
Monitor x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens headers
Use batch API for high-volume async processing

from tenacity import retry, wait_exponential, stop_after_attempt

@retry(wait=wait_exponential(multiplier=1, min=4, max=60), stop=stop_after_attempt(5))
def call_openai(messages):
    return client.chat.completions.create(model="gpt-4o", messages=messages)

8. What is the system message role?

System messages set the assistant's behavior, personality, and constraints. They carry more weight than user messages. Best practices: be specific, use clear rules, set boundaries ("Only answer questions about Python programming").

9. What is the assistant role?

Assistant messages represent previous model responses in a conversation. They're crucial for multi-turn chats — they maintain context and demonstrate the desired response pattern.

10. What are finish reasons?

stop: Natural completion or stop sequence hit
length: Max tokens reached, response truncated
tool_calls: Model requested function call
content_filter: Content flagged by moderation
function_call (deprecated): Use tool_calls instead

11. What is the max_tokens parameter?

max_tokens limits the response length. The model stops generating when it reaches this limit. Important: max_tokens + prompt tokens must be ≤ model's context window. Set reasonably to control costs and prevent overlong completions.

12. What are stop sequences?

Stop sequences are strings that terminate generation: [".", "END", "\n\n\n"]. Useful for structured outputs — the model stops when it generates the sequence.

13. What is `n` parameter?

n generates multiple completions from a single prompt. n=3 returns 3 different responses. Costs 3× as much. Useful for generating diverse options or for best-of-N ensembling.

14. What are logprobs?

logprobs return the log probabilities of generated tokens. Used for:

Understanding model uncertainty
Confidence scoring
Constrained generation
Token-level analysis

15. What is the seed parameter?

seed enables reproducible outputs (deterministic). Same seed + same system_fingerprint → same output. Beta feature. Not 100% guaranteed but significantly reduces variance.

16. What is the response_format parameter?

response_format ensures structured output:

{"type": "text"} — Default plain text
{"type": "json_object"} — JSON output (must mention "JSON" in prompt)
Structured Outputs (Beta): JSON Schema via response_format={"type": "json_schema", "json_schema": {...}}

17. What is GPT-4 Vision?

Vision allows the model to process images. Pass images as base64 or URLs in user messages. Supports multiple images per request. Analyze charts, diagrams, screenshots, photographs.

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]
    }]
)

18. What is the Embeddings API?

Embeddings API converts text to vectors. Models: text-embedding-3-small (1536d, cheap), text-embedding-3-large (3072d, better), text-embedding-ada-002 (legacy, 1536d).

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Your text here",
    encoding_format="float"
)
vector = response.data[0].embedding

19. What is DALL-E API?

DALL-E generates images from text prompts. Models: dall-e-3 (highest quality), dall-e-2 (faster). Parameters: quality (standard/hd), size, style (vivid/natural).

response = client.images.generate(
    model="dall-e-3",
    prompt="A serene lake at sunset with mountains",
    size="1024x1024",
    quality="hd",
    n=1
)
image_url = response.data[0].url

20. What is the TTS (Text-to-Speech) API?

TTS converts text to natural-sounding speech. Models: tts-1 (faster), tts-1-hd (better quality). Voices: alloy, echo, fable, onyx, nova, shimmer.

21. What is the Whisper API?

Whisper transcribes audio to text. Supports multiple languages, translation to English. File size: max 25MB. Formats: mp3, mp4, mpeg, wav, webm.

22. What is the Moderation API?

Moderations checks content for safety violations. Free to use. Categories: hate, self-harm, sexual, violence. Returns flagged status and category scores.

23. How do you manage conversation history with the API?

Client-side management: maintain a messages array, append user/assistant messages. Trim oldest messages when approaching context limit. Use tiktoken to count tokens.

messages = [
    {"role": "system", "content": "You are helpful."}
]
while True:
    user_input = input("You: ")
    messages.append({"role": "user", "content": user_input})
    response = client.chat.completions.create(model="gpt-4o", messages=messages)
    assistant_msg = response.choices[0].message.content
    messages.append({"role": "assistant", "content": assistant_msg})

24. What are Parallel Function Calls?

The model can request multiple functions in a single response. Each function call is handled independently, and all results are returned together. Reduces round-trips for multi-step operations.

25. What is error handling in OpenAI API?

Common errors:

401 Unauthorized: Invalid API key
429 Rate Limit: Too many requests
500 Server Error: OpenAI outage
context_length_exceeded: Prompt too long
Always implement retries with exponential backoff

26. What is the Batch API?

Batch API processes requests asynchronously at 50% discount. Submit a .jsonl file of requests, receive results within 24 hours. No rate limits. Best for non-interactive, bulk workloads.

27. How does Anthropic's Claude API compare to OpenAI?

Feature	OpenAI	Anthropic Claude
Model names	gpt-4o, gpt-4-turbo	claude-3-opus, sonnet, haiku
Context	128K	200K
Function calling	Native	Tool use
Safety	Moderation API	Constitutional AI
Streaming	SSE	SSE
Vision	Yes	Yes (claude-3)
API style	Parameter-focused	Message-focused

28. What are the best practices for estimating token usage?

Use tiktoken to count tokens before API calls
Track cumulative usage with response.usage
Set conservative max_tokens
Use cheaper models for drafting (gpt-3.5-turbo for first pass, gpt-4o for final)

29. How do you implement caching for API calls?

Exact match cache: Same prompt → same response (deterministic at temp=0)
Semantic cache: Similar prompt → cached response (embedding-based)
Use Redis/Memcached with TTL
Cache embeddings to avoid recomputation

30. What is the difference between OpenAI and Azure OpenAI?

OpenAI: Direct API. Global availability. Simpler integration.
Azure OpenAI: Enterprise features: VPC, RBAC, private endpoints, compliance (SOC2, HIPAA), content filtering, SLAs. Same models.

31. What are the Anthropic Claude API specific features?

Extended Thinking: Claude shows internal reasoning process
Tool use: Native tool calling with tools parameter
System prompt: Separated from messages like OpenAI
Usage: Shows cache creation/read tokens for prompt caching

32. What is prompt caching (Anthropic)?

Claude can cache long system prompts and large contexts. Cached tokens cost 10% of standard input price. Dramatically reduces cost for applications with large, static context.

33. How do you use Google's Gemini API?

Google AI Studio + Vertex AI. Gemini models: gemini-1.5-flash (fast/cheap), gemini-1.5-pro (powerful). 1M token context. Native multimodality (text, image, audio, video). Safety settings configurable.

34. What is the Cohere API best for?

Cohere specializes in enterprise search and RAG: strong embedding models, built-in reranker, chat with RAG connector, multilingual models. Complementary to general-purpose LLM APIs.

35. What is Ollama?

Ollama runs open-source LLMs locally (Llama, Mistral, Phi, Gemma). Local API compatible with OpenAI client. No data leaves your machine. Use for development, privacy-sensitive, or offline scenarios.

36. What are API key management best practices?

Never hardcode keys in source code
Use environment variables or secret managers
Rotate keys regularly
Set usage limits per key
Use separate keys for dev/staging/prod
Monitor key usage in dashboard

37. What is the Assistants API?

Assistants API provides persistent AI assistants with:

Thread management (conversation state)
Code Interpreter (Python sandbox)
File Search (RAG with managed storage)
Function calling (tool integration) Higher-level than raw chat completions.

38. What is the difference between Open Source models (via API) and proprietary APIs?

Open Source (Together AI, Anyscale, Groq): Cheaper, fine-tunable, self-hostable. Slightly lower quality.
Proprietary (OpenAI, Anthropic): Highest quality, managed infrastructure, research investment. More expensive.

39. How do you implement fallback between API providers?

try:
    response = call_gpt4(prompt)
except (RateLimitError, APIError):
    response = call_claude(prompt)
except Exception:
    response = call_llama(prompt)  # Last resort

40. What is the cost optimization strategy for LLM APIs?

Use smaller/cheaper models for simple tasks
Batch non-urgent requests
Enable prompt caching where available
Pre-compute and cache embeddings
Use streaming to improve perceived speed
Set aggressive max_tokens to prevent overspending
Consider fine-tuned small models replacing large ones

PYTHON PLAYGROUND

⏳ Loading editor…

AI Mentor

Assistant

Confused about "OpenAI APIs, function calling, streaming, vision, DALL-E, TTS, embedding APIs, Anthropic Claude, Google Gemini, and API integration patterns"? Ask our AI mentor for a simplified explanation.

Quiz

Question 1 of 3

What is the purpose of function calling (tool use) in LLM APIs?

To call functions within the model itself

To allow the model to request execution of external functions with structured arguments

To replace the model with function-based logic

To compile Python code inside the model

OpenAI & LLM APIs Interview Questions

1. What is the OpenAI Chat Completions API?

2. What is the difference between gpt-4o and gpt-4-turbo?

3. What is Function Calling (Tool Use)?

4. How does Streaming work?

5. What are Token Usage and Pricing?

6. What is the OpenAI Python client structure?

7. How do you handle rate limits?

8. What is the system message role?

9. What is the assistant role?

10. What are finish reasons?

11. What is the max_tokens parameter?

12. What are stop sequences?

13. What is n parameter?

14. What are logprobs?

15. What is the seed parameter?

16. What is the response_format parameter?

17. What is GPT-4 Vision?

18. What is the Embeddings API?

19. What is DALL-E API?

20. What is the TTS (Text-to-Speech) API?

21. What is the Whisper API?

22. What is the Moderation API?

23. How do you manage conversation history with the API?

24. What are Parallel Function Calls?

25. What is error handling in OpenAI API?

26. What is the Batch API?

27. How does Anthropic's Claude API compare to OpenAI?

28. What are the best practices for estimating token usage?

29. How do you implement caching for API calls?

30. What is the difference between OpenAI and Azure OpenAI?

31. What are the Anthropic Claude API specific features?

32. What is prompt caching (Anthropic)?

33. How do you use Google's Gemini API?

34. What is the Cohere API best for?

35. What is Ollama?

36. What are API key management best practices?

37. What is the Assistants API?

38. What is the difference between Open Source models (via API) and proprietary APIs?

39. How do you implement fallback between API providers?

40. What is the cost optimization strategy for LLM APIs?

AI Mentor

Quiz

Quiz

13. What is `n` parameter?