Secbez Docs

Bring Your Own Models

Run Secbez with any LLM — managed providers, OpenAI-compatible endpoints, or your own open-source serving stack.

Enterprise deployments are not tied to any single LLM provider. Secbez has a structured contract for every reasoning step, and the routing layer can dispatch it to whatever model you point it at.

Provider integrations

Secbez integrates with the major hosted LLM providers and with the common open-source serving stacks. Exact wiring (keys, endpoints, regional details) is scoped per engagement.

ProviderNotes
OpenAIDirect API or any OpenAI-compatible endpoint
Azure OpenAIPer-deployment routing supported
AnthropicClaude family
AWS BedrockBedrock-hosted model families
Google Vertex AIVertex-hosted model families
vLLMOpen-source serving
Text Generation Inference (TGI)Open-source serving
OllamaConvenient for single-node and developer deployments
llama.cpp / llamafileCPU and Apple Silicon
TensorRT-LLM (Triton)NVIDIA-optimized
Custom HTTPAny internal model gateway with a documented contract

Mixing hosted models for one part of the pipeline with local open-source for the rest is fully supported. Fallback chains are also supported, so a primary model that times out can fall back to a secondary without interrupting the scan.

Choosing a model

Secbez's reasoning contracts are structured — models must reliably produce JSON in the requested schema. We don't publish a fixed model recommendation list, because the right choice depends on your region, compliance constraints, hardware, and cost profile.

What we do instead: as part of an Enterprise engagement, we work with your team to select models that fit your environment, validate them against the Secbez evals harness (precision, recall, JSON-mode adherence, cost-per-confirmed-finding), and configure routing accordingly.

Latency, throughput, and degradation

  • The scan pipeline is concurrent. Latency-per-call dominates only when one step is on the critical path.
  • The model gateway implements per-step timeouts and fallback chains. If your primary model is unavailable, a secondary picks up; if all are unavailable, the scan degrades to deterministic fallback explanations and the gate decision is unaffected.
  • Cost per confirmed finding is the metric we recommend tracking — not raw call count.

See BYO GPU for compute guidance.

On this page