NLP Pipelines in Production

Building a prototype in a Jupyter Notebook is easy. Deploying it to serve millions of users with 99.9% uptime, less than 100ms latency, and less than $0.001 cost per query is the real challenge. This chapter bridges the gap between research and production.

1. The Production Reality Check ⚠️

Research vs. Production

Aspect	Research	Production
Data	Clean CSV files	Messy streams, PDFs, APIs
Latency	It takes 30 seconds	Must be under 200ms or users leave
Errors	Try again	5 9s uptime (99.999%)
Scale	100 examples	10M requests/day
Cost	Free GPU credits	$10,000/month GPU bill

The gap is enormous. Let's close it.

2. Data Ingestion Architecture 📥

Real-world data doesn't arrive in nice batches. It streams in from multiple sources.

Batch vs. Stream Processing

Batch:

Process 10GB of customer reviews every night
Run sentiment analysis, generate reports
Tools: Airflow, Prefect, Cron jobs

Stream:

Process live chatbot messages
Analyze social media mentions in real-time
Tools: Kafka, RabbitMQ, AWS Kinesis

Multi-Source Ingestion

PYTHON PLAYGROUND

⏳ Loading editor…

3. The Processing Pipeline ⚙️

Before hitting the model, data must flow through multiple stages.

Pipeline Stages

Stage 1: Validation & Sanitization

PYTHON PLAYGROUND

⏳ Loading editor…

Stage 2: PII Redaction (Critical for Privacy)

Personally Identifiable Information (PII) must be removed before sending to external APIs.

Email addresses: john@example.com → [EMAIL]
Phone numbers: 555-123-4567 → [PHONE]
Credit cards: 4532-1234-5678-9010 → [CARD]
SSNs, IDs: 123-45-6789 → [SSN]

Stage 3: Preprocessing

PYTHON PLAYGROUND

⏳ Loading editor…

4. Model Serving Strategies 🚀

How do you expose your model to the world?

Option 1: REST API (Most Common)

Tools: FastAPI, Flask, Django Pros: Simple, universal, works with any client Cons: HTTP overhead, not the fastest

PYTHON PLAYGROUND

⏳ Loading editor…

Option 2: gRPC (High Performance)

Tools: gRPC, Protocol Buffers Pros: Binary protocol, streaming, 7x faster than REST Cons: More complex, requires code generation

Option 3: Serverless (Event-Driven)

Tools: AWS Lambda, Google Cloud Functions Pros: Auto-scaling, pay per request Cons: Cold starts (200-500ms), 15min timeout

Option 4: Specialized LLM Serving

Tools: vLLM, TGI (Text Generation Inference), Triton Pros: Optimized for LLMs (PagedAttention, continuous batching) Result: 10-20x higher throughput

5. Scaling Strategies 📈

Horizontal Scaling

Caching Layer

Problem: Same queries repeated (e.g., "What's the weather?")
Solution: Redis cache with TTL
Result: 90% cache hit rate = 10x cost reduction

Batch Inference

Group requests together to maximize GPU utilization:

Single request: 1ms model time, 2% GPU usage (wasteful)
Batch of 32: 8ms total, 80% GPU usage (efficient)

6. Monitoring & Observability 📊

You can't improve what you don't measure.

Key Metrics

Metric	Target	Alert If
Latency (p99)	under 200ms	over 500ms
Error Rate	under 0.1%	over 1%
Throughput	1000 req/s	under 500 req/s
Cost per 1K requests	$0.10	over $0.50
Model Drift	under 5% accuracy drop	over 10% drop

Logging Best Practices

PYTHON PLAYGROUND

⏳ Loading editor…

7. Cost Optimization 💰

NLP in production can be expensive. Here's how to optimize:

Strategy 1: Model Quantization

Convert FP32 → INT8 weights
Result: 4x smaller, 2-3x faster,and less than 2% accuracy loss

Strategy 2: Distillation

Train a small student model to mimic a large teacher
Example: Distill GPT-4 → 7B model (100x cheaper)

Strategy 3: Prompt Caching (for LLMs)

Cache common prefixes (system prompts)
Savings: 50-90% cost reduction on OpenAI API

Quiz

Question 1 of 4

What is the difference between Batch and Stream processing?

Batch is always faster

Batch processes data in scheduled chunks; Stream processes data in real-time as it arrives

There is no difference

Key Takeaways

✅ Production pipelines require validation, PII redaction, and monitoring—not just the model.
✅ Stream processing (Kafka) is essential for real-time applications.
✅ Specialized serving (vLLM, TGI) dramatically improves LLM efficiency.
✅ Caching is the easiest way to cut costs and improve latency.
✅ Monitoring with structured logs and metrics prevents disasters.

What's Next?

You've mastered text processing, semantic search, multilingual systems, and production deployment. Now it's time to unlock creativity. AI isn't just for understanding—it can create.

Next Module: Module 7 — Generative AI & Creative Models 🎨