#001 HOW to use CHATGPT ON-PREM (Local fully private AI) ?

Own your AI . Be Local

SK5140

Sep 11, 2025

Running ChatGPT fully on-prem isn’t something OpenAI currently provides out-of-the-box. By default, ChatGPT is a cloud-hosted service.

There are three main paths if you want on-premise or self-hosted LLM capabilities as follows :

1. Use OpenAI’s APIs with a Private Network Setup

Deploy ChatGPT through Azure OpenAI Service (Microsoft offers “data stays in tenant” guarantees).
You run everything in your private Azure subscription (effectively a VPC).
Pros: Access to latest GPT-4/GPT-4o models, enterprise security/compliance.
Cons: Still not strictly “on-prem,” since compute runs in Azure.

2. Run Open-Source ChatGPT Alternatives On-Prem

If you need full on-prem control, you can deploy open-source models that mimic ChatGPT’s functionality:

LLaMA-2 / LLaMA-3 (Meta) – widely used, supports fine-tuning.
Mistral / Mixtral – strong reasoning, open-weight models.
Falcon – optimized for enterprise inference.
GPT-J / GPT-NeoX – earlier open releases.

Typical Setup:

Run models with frameworks like:
- vLLM (optimized inference server)
- Hugging Face transformers + accelerate
- Text Generation Inference (TGI)
Deploy inside Kubernetes / Docker clusters on your on-prem GPUs.
Expose a REST or gRPC API internally, so applications call it like ChatGPT.

3. Hybrid Approach (Governance + AI Gateway)

Keep inference on-prem (open-source LLM).
Use retrieval-augmented generation (RAG) with your enterprise data.
Govern access through tools like LLM Gateway, Kong API Gateway, or custom proxy.
Add monitoring, logging, and prompt governance for compliance.

✅ Decision Factors:

Strict regulatory needs? → Go open-source LLMs fully on-prem.
Want GPT-4 level reasoning? → Azure OpenAI (but not fully on-prem).
Cost-sensitive with GPUs? → Smaller open-source LLMs fine-tuned for tasks.

Here is a step-by-step on-prem deployment guide (with Docker/K8s + open-source ChatGPT-like model + API endpoint)

Dead-simple starter you can stand up locally: Docker + k3s (via k3d) + open-source “ChatGPT-like” model + OpenAI-style API endpoint. We’ll use a tiny, CPU-friendly stack so you don’t need GPUs to try it out.

What you’ll get

A single-node k3s cluster (running inside Docker via k3d)
A Kubernetes Deployment running llama.cpp in server mode (serves an OpenAI-compatible API)
A Service exposed on

http://localhost:8000

→ /v1/chat/completions, /v1/models, etc.
A one-shot initContainer that downloads a quantized .gguf instruct model at startup

0) Prereqs (quick)

Linux/macOS (Windows works with WSL2)
Docker installed and running
kubectl and k3d installed

Tip (optional installers):

# kubectl (Linux/x86_64)
curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl
chmod +x kubectl && sudo mv kubectl /usr/local/bin/

# k3d
curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash

1) Create a tiny k3s cluster and open port 8000

k3d cluster create llm -p "8000:8000@loadbalancer"
kubectl get nodes

This gives you a k3s cluster inside Docker, and maps the cluster LoadBalancer to your host’s localhost:8000.

2) Apply the “starter” manifest

Replace MODEL_URL with a direct link to a small instruct model in GGUF format (e.g., a Q4_K_M variant of Mistral-7B-Instruct or Llama-3-instruct). You can also host the file internally and point at that URL. Aim for a 4–8 GB .gguf to keep memory light.

Save the YAML below as llm-starter.yaml, edit the MODEL_URL env var, then apply.

apiVersion: v1
kind: Namespace
metadata:
  name: llm
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-server
  namespace: llm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama-server
  template:
    metadata:
      labels:
        app: llama-server
    spec:
      volumes:
        - name: models
          emptyDir: {}  # ephemeral; fine for a starter
      initContainers:
        - name: fetch-model
          image: curlimages/curl:8.8.0
          env:
            - name: MODEL_URL
              value: "https://YOUR-INTERNAL-OR-HF-DIRECT-LINK/model.Q4_K_M.gguf"
          command: ["sh", "-c"]
          args:
            - |
              set -e
              echo "Downloading model to /models/model.gguf ..."
              curl -L "$MODEL_URL" -o /models/model.gguf
              ls -lh /models
          volumeMounts:
            - name: models
              mountPath: /models
      containers:
        - name: llama
          image: ghcr.io/ggerganov/llama.cpp:full
          # Exposes an OpenAI-compatible API (chat/completions, etc.)
          command: ["llama-server"]
          args:
            [
              "-m", "/models/model.gguf",
              "-c", "4096",
              "--host", "0.0.0.0",
              "--port", "8000",
              "-ngl", "0",
              "--api-key", "changeme"
            ]
          ports:
            - containerPort: 8000
          resources:
            requests:
              cpu: "1"
              memory: "4Gi"
            limits:
              cpu: "2"
              memory: "8Gi"
          volumeMounts:
            - name: models
              mountPath: /models
---
apiVersion: v1
kind: Service
metadata:
  name: llama-svc
  namespace: llm
spec:
  type: LoadBalancer
  selector:
    app: llama-server
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000

Apply it:

kubectl apply -f llm-starter.yaml
kubectl -n llm rollout status deploy/llama-server
kubectl -n llm get pods,svc

When ready, your API should be at:

http://localhost:8000

(thanks to k3d’s port mapping).

3) Smoke test with cURL

# List models (llama.cpp usually returns the one it loaded)
curl -s http://localhost:8000/v1/models

# Simple chat completion
curl -s http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer changeme" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "local-gguf",
        "messages": [
          {"role":"system","content":"You are a helpful assistant."},
          {"role":"user","content":"Give me one sentence on why k3s is lightweight."}
        ],
        "temperature": 0.7
      }' | jq .

You should get a JSON response with an assistant message.

4) Use it like OpenAI from Python (OpenAI-style client)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="changeme")

resp = client.chat.completions.create(
    model="local-gguf",
    messages=[
        {"role": "system", "content": "Be concise."},
        {"role": "user", "content": "Summarize why k3s + Docker is nice for a dev laptop."},
    ],
    temperature=0.3,
)
print(resp.choices[0].message.content)

Notes & tips

Model choice (GGUF, instruct-tuned): pick an instruct variant (e.g., “Instruct”, “Chat”, “Vn” styles). Quantized Q4_K_M is a good speed/quality tradeoff for CPU.
Memory: 8–12 GB RAM works for small quantized models; bump limits if OOM.
Security: change --api-key from changeme. For internal nets, you can add an Ingress + mTLS later; this starter keeps it simple.
Persistence: emptyDir wipes on redeploy. To keep the model cached, swap emptyDir for a PersistentVolumeClaim, or bake your own image with the model inside.
Speed: This is CPU-only to be beginner-friendly. If you have NVIDIA GPUs, we can swap to a GPU image or vLLM + Mistral/Llama and wire up the NVIDIA device plugin.

CUT CLUTTER IN TECH

Discussion about this post

Ready for more?