Prerequisites
- ngrok account with AI Gateway access
- Ollama installed locally
- ngrok agent installed
Overview
Since Ollama runs locally on HTTP, you’ll expose it through an ngrok internal endpoint, then configure the AI Gateway to route requests to it.Getting started
Expose Ollama with ngrok
Use the ngrok agent to create an internal endpoint:
Internal endpoints (
.internal domains) are private to your ngrok account. They’re not accessible from the public internet.Advanced configuration
Restrict to Ollama only
Block requests to cloud providers and only allow Ollama:policy.yaml
Failover to cloud provider
Use Ollama as primary with automatic failover to OpenAI:policy.yaml
The first strategy that returns models wins. If Ollama has matching models, only those are tried. OpenAI is only used if no Ollama models match. For cross-provider failover when requests fail, have clients specify multiple models:
models: ["ollama:llama3.2", "openai:gpt-4o"].Increase timeouts
Local models can be slower, especially on first load. Increase timeouts as needed:policy.yaml
Multiple Ollama instances
Load balance across multiple machines:policy.yaml
Add model metadata
Track model details with metadata:policy.yaml
Troubleshooting
Connection refused
Symptom: Requests fail with connection errors. Solutions:- Verify Ollama is running:
curl http://localhost:11434/api/tags - Verify ngrok tunnel is running: Check for
https://ollama.internalin your ngrok dashboard - Ensure the internal endpoint URL matches your config
Model not found
Symptom: Error saying model doesn’t exist. Solutions:- List available models:
ollama list - Pull the model:
ollama pull llama3.2 - Verify the model ID matches exactly (including tags like
:1b)
Slow first response
Symptom: First request takes a very long time. Cause: Ollama loads models into memory on first use. Solutions:- Increase
per_request_timeoutto allow for model loading - Pre-warm the model:
curl http://localhost:11434/api/generate -d '{"model":"llama3.2","prompt":""}' - Keep the model loaded by sending periodic requests
Out of memory
Symptom: Ollama crashes or returns errors for large models. Solutions:- Use a smaller model or quantized version (for example,
llama3.2:1b) - Increase system RAM or use a machine with more VRAM
- Set
OLLAMA_NUM_PARALLEL=1to limit concurrent requests
Next steps
- Custom Providers - Learn about URL requirements and configuration options
- Model Selection Strategies - Route requests intelligently
- Multi-Provider Failover - Advanced failover patterns