Ollama

Ollama is a popular tool for running open-source large language models locally. This guide shows you how to connect Ollama to the ngrok AI Gateway, enabling you to use local models through the OpenAI SDK with automatic failover to cloud providers.

Prerequisites

ngrok account with AI Gateway access
Ollama installed locally
ngrok agent installed

Overview

Since Ollama runs locally on HTTP, you’ll expose it through an ngrok internal endpoint, then configure the AI Gateway to route requests to it.

Getting started

Start Ollama

Start the Ollama server:

ollama serve

Pull a model if you haven’t already:

ollama pull llama3.2

Verify Ollama is running:

curl http://localhost:11434/api/tags

Expose Ollama with ngrok

Use the ngrok agent to create an internal endpoint:

ngrok http 11434 --url https://ollama.internal

Internal endpoints (.internal domains) are private to your ngrok account. They’re not accessible from the public internet.

Configure the AI Gateway

Create a Traffic Policy with Ollama as a provider:

policy.yaml

on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "ollama"
          base_url: "https://ollama.internal"
          models:
            - id: "llama3.2"
            - id: "llama3.2:1b"
            - id: "mistral"
            - id: "codellama"

Ollama doesn’t require API keys, so you can omit the api_keys field entirely.

Use with OpenAI SDK

Point any OpenAI-compatible SDK at your AI Gateway:

from openai import OpenAI

client = OpenAI(
    base_url="https://your-ai-subdomain.ngrok.app/v1",
    api_key="unused"  # Ollama doesn't need a key
)

response = client.chat.completions.create(
    model="ollama:llama3.2",  # Prefix with provider ID
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Advanced configuration

Restrict to Ollama only

Block requests to cloud providers and only allow Ollama:

policy.yaml

on_http_request:
  - type: ai-gateway
    config:
      only_allow_configured_providers: true
      only_allow_configured_models: true
      providers:
        - id: "ollama"
          base_url: "https://ollama.internal"
          models:
            - id: "llama3.2"
            - id: "mistral"

Failover to cloud provider

Use Ollama as primary with automatic failover to OpenAI:

policy.yaml

on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "ollama"
          base_url: "https://ollama.internal"
          models:
            - id: "llama3.2"
        
        - id: "openai"
          api_keys:
            - value: ${secrets.get('openai', 'api-key')}
      
      model_selection:
        strategy:
          - "ai.models.filter(m, m.provider_id == 'ollama')"
          - "ai.models.filter(m, m.provider_id == 'openai')"

The first strategy that returns models wins. If Ollama has matching models, only those are tried. OpenAI is only used if no Ollama models match. For cross-provider failover when requests fail, have clients specify multiple models: models: ["ollama:llama3.2", "openai:gpt-4o"].

Increase timeouts

Local models can be slower, especially on first load. Increase timeouts as needed:

policy.yaml

on_http_request:
  - type: ai-gateway
    config:
      per_request_timeout: "120s"
      total_timeout: "5m"
      providers:
        - id: "ollama"
          base_url: "https://ollama.internal"
          models:
            - id: "llama3.2"

Multiple Ollama instances

Load balance across multiple machines:

policy.yaml

on_http_request:
  - type: ai-gateway
    config:
      only_allow_configured_providers: true
      providers:
        - id: "ollama-gpu-1"
          base_url: "https://ollama-gpu-1.internal"
          models:
            - id: "llama3.2"
        
        - id: "ollama-gpu-2"
          base_url: "https://ollama-gpu-2.internal"
          models:
            - id: "llama3.2"
      
      model_selection:
        strategy:
          # Randomize across configured Ollama instances
          - "ai.models.filter(m, m.provider_id in ['ollama-gpu-1', 'ollama-gpu-2']).random()"

Add model metadata

Track model details with metadata:

policy.yaml

on_http_request:
  - type: ai-gateway
    config:
      providers:
        - id: "ollama"
          base_url: "https://ollama.internal"
          metadata:
            location: "local"
            hardware: "RTX 4090"
          models:
            - id: "llama3.2"
              metadata:
                parameters: "8B"
                quantization: "Q4_K_M"
            - id: "llama3.2:70b"
              metadata:
                parameters: "70B"
                quantization: "Q4_K_M"

Troubleshooting

Connection refused

Symptom: Requests fail with connection errors. Solutions:

Verify Ollama is running: curl http://localhost:11434/api/tags
Verify ngrok tunnel is running: Check for https://ollama.internal in your ngrok dashboard
Ensure the internal endpoint URL matches your config

Model not found

Symptom: Error saying model doesn’t exist. Solutions:

List available models: ollama list
Pull the model: ollama pull llama3.2
Verify the model ID matches exactly (including tags like :1b)

Slow first response

Symptom: First request takes a very long time. Cause: Ollama loads models into memory on first use. Solutions:

Increase per_request_timeout to allow for model loading
Pre-warm the model: curl http://localhost:11434/api/generate -d '{"model":"llama3.2","prompt":""}'
Keep the model loaded by sending periodic requests

Out of memory

Symptom: Ollama crashes or returns errors for large models. Solutions:

Use a smaller model or quantized version (for example, llama3.2:1b)
Increase system RAM or use a machine with more VRAM
Set OLLAMA_NUM_PARALLEL=1 to limit concurrent requests

Next steps

Custom Providers - Learn about URL requirements and configuration options
Model Selection Strategies - Route requests intelligently
Multi-Provider Failover - Advanced failover patterns

SDKs

Concepts

Guides

Custom Providers

Observability

Examples

Reference

Prerequisites

Overview

Getting started

Advanced configuration

Restrict to Ollama only

Failover to cloud provider

Increase timeouts

Multiple Ollama instances

Add model metadata

Troubleshooting

Connection refused

Model not found

Slow first response

Out of memory

Next steps

SDKs

Concepts

Guides

Custom Providers

Observability

Examples

Reference

​Prerequisites

​Overview

​Getting started

​Advanced configuration

​Restrict to Ollama only

​Failover to cloud provider

​Increase timeouts

​Multiple Ollama instances

​Add model metadata

​Troubleshooting

​Connection refused

​Model not found

​Slow first response

​Out of memory

​Next steps

Prerequisites

Overview

Getting started

Advanced configuration

Restrict to Ollama only

Failover to cloud provider

Increase timeouts

Multiple Ollama instances

Add model metadata

Troubleshooting

Connection refused

Model not found

Slow first response

Out of memory

Next steps