Bring Your Own Models

Run Secbez with any LLM — managed providers, OpenAI-compatible endpoints, or your own open-source serving stack.

Enterprise deployments are not tied to any single LLM provider. Secbez has a structured contract for every reasoning step, and the routing layer can dispatch it to whatever model you point it at.

Provider integrations

Secbez integrates with the major hosted LLM providers and with the common open-source serving stacks. Exact wiring (keys, endpoints, regional details) is scoped per engagement.

Provider	Notes
OpenAI	Direct API or any OpenAI-compatible endpoint
Azure OpenAI	Per-deployment routing supported
Anthropic	Claude family
AWS Bedrock	Bedrock-hosted model families
Google Vertex AI	Vertex-hosted model families
vLLM	Open-source serving
Text Generation Inference (TGI)	Open-source serving
Ollama	Convenient for single-node and developer deployments
llama.cpp / llamafile	CPU and Apple Silicon
TensorRT-LLM (Triton)	NVIDIA-optimized
Custom HTTP	Any internal model gateway with a documented contract

Mixing hosted models for one part of the pipeline with local open-source for the rest is fully supported. Fallback chains are also supported, so a primary model that times out can fall back to a secondary without interrupting the scan.

Choosing a model

Secbez's reasoning contracts are structured — models must reliably produce JSON in the requested schema. We don't publish a fixed model recommendation list, because the right choice depends on your region, compliance constraints, hardware, and cost profile.

What we do instead: as part of an Enterprise engagement, we work with your team to select models that fit your environment, validate them against the Secbez evals harness (precision, recall, JSON-mode adherence, cost-per-confirmed-finding), and configure routing accordingly.

Latency, throughput, and degradation

The scan pipeline is concurrent. Latency-per-call dominates only when one step is on the critical path.
The model gateway implements per-step timeouts and fallback chains. If your primary model is unavailable, a secondary picks up; if all are unavailable, the scan degrades to deterministic fallback explanations and the gate decision is unaffected.
Cost per confirmed finding is the metric we recommend tracking — not raw call count.

See BYO GPU for compute guidance.

Bring Your Own Models

Provider integrations

Choosing a model

Latency, throughput, and degradation

On this page