Compute Containers Storage Databases Network AI Analytics IaC Quantum

Public cloud — AI & Machine Learning

AI infrastructure, three ways in.

Build models from scratch on the AI Platform. Call ready-made models through AI Endpoints. Or rent the raw GPU. Same operator, same SLA, same audit trail — whichever altitude you pick.

Schedule consultation ⟶ See the tools

NVIDIA GPUs

H200, H100, L40S, and L4 — current generation, hosted on intSignal infrastructure.

Three altitudes

Build from scratch, call an API, or rent the GPU. Pick what fits.

Encrypted by default

At rest and in flight on every layer of the stack.

24/7 operations

On-call rotation with incident response and post-incident reviews.

Three ways in

Build, call, or rent. You decide the altitude.

Most AI buyers fit one of three shapes. Pick the tier that matches where your team actually is — and slide between them as the project changes.

TIER 01

YOU HAVE A DATA SCIENCE TEAM

AI Platform

End-to-end workspace for building models from scratch. Notebooks, training jobs, deployment, inference — all under one operator, with versioned artifacts and reproducible runs.

AI Notebooks
AI Training
AI Deploy
Inference Solutions
Cluster on Demand

See the platform →

TIER 02

YOU JUST NEED THE MODEL

AI Endpoints

Pre-built model APIs for embeddings, generation, classification, and translation. Pick a model from the catalog, call the endpoint, ship to production. No training, no GPU management.

AI API Services
Model Catalogue
Base · Fast · Batch APIs

Browse endpoints →

TIER 03

YOU WANT THE RAW HARDWARE

GPU Cloud

Dedicated NVIDIA GPUs on intSignal infrastructure. H200 and H100 for large model training and inference. L40S for mixed AI and visual workloads. L4 for cost-efficient inference.

NVIDIA H200
NVIDIA H100
NVIDIA L40S
NVIDIA L4

Compare GPUs →

Tier 01 · AI Platform

The whole lifecycle, under one operator.

Standard frameworks. Standard storage. Standard tracking. The platform handles the things that bog data science teams down — environment setup, GPU scheduling, artifact versioning, model serving — so you can focus on the model itself.

AI Notebooks

EXPLORE · PREPARE · PROTOTYPE

Managed JupyterLab environments with pre-built images for PyTorch, TensorFlow, and JAX. GPU-backed when you need them, CPU-only when you don't. Persistent volumes mean your work survives session restarts.

JupyterLab with pre-built ML images
Attach GPU on demand, detach when idle
Persistent storage per notebook environment
Direct access to object storage and managed databases
Shared environments for team collaboration

python · notebook.ipynb

# Standard imports. GPU available if attached.
import torch
import pandas as pd
from intsignal import storage

# Pull training data from intSignal storage
df = storage.read_parquet("s3://datasets/orders-2026.parquet")
print(f"Loaded {len(df):,} rows")

# Check what GPU we're on
if torch.cuda.is_available():
    gpu = torch.cuda.get_device_name(0)
    print(f"GPU: {gpu}")
# → GPU: NVIDIA H100 80GB HBM3

# Prototype model on a subsample
sample = df.sample(10_000, random_state=42)
# ... build / train / iterate ...

AI Training

SCHEDULED JOBS · MULTI-GPU · CHECKPOINTED

Submit training jobs that get scheduled on real GPUs without you managing the cluster. Distributed training across multiple GPUs or nodes for large models. Automatic checkpointing so a preempted job doesn't lose hours of progress.

Submit jobs via CLI, SDK, or notebook
Single-GPU, multi-GPU, or multi-node configurations
Automatic checkpointing to object storage
Experiment tracking with metrics and artifacts
Spot-priced training for cost-flexible workloads

yaml · harden.yml

# CIS hardening role applied to all production nodes
- name: Apply CIS baseline
  hosts: production
  become: yes
  vars:
    cis_level: "2"
    audit_log_retention: "90d"

  roles:
    - role: intsignal.cis_hardening
      tags: [hardening, compliance]
    - role: intsignal.osquery
      tags: [endpoint, monitoring]
    - role: intsignal.log_forwarder
      vars:
        target: logs.intsignal.io
      tags: [logging]

  handlers:
    - name: restart auditd
      service: { name: auditd, state: restarted }

AI Deploy

CONTAINERIZED · VERSIONED · ROLLBACK-READY

Package your trained model as a container and deploy it to an inference endpoint. Built-in versioning means rollbacks are one command. Traffic-splitting lets you canary new model versions against production safely.

Containerized model packaging
Version pinning with rollback to any prior
Canary deployments with traffic splitting
Auto-scaling based on request volume
Built-in monitoring of latency and errors

yaml · deploy.yaml

# Deploy a model version with canary traffic split
apiVersion: ai.intsignal.io/v1
kind: ModelDeployment
metadata:
  name: order-classifier
spec:
  model:
    source: s3://models/classifier/v4.2/
    runtime: pytorch:2.4-cuda12
  resources:
    gpu: l40s
    count: 1
  autoscale:
    min: 1
    max: 8
    target_qps: 50
  traffic:
    - version: v4.1
      weight: 90
    - version: v4.2
      weight: 10   # canary

Inference Solutions

SERVE · SCALE · OBSERVE

Production-grade inference serving with the runtime tuned to the model — Triton for general PyTorch/TensorFlow, vLLM for LLMs, ONNX Runtime where it fits. Latency and throughput are observable. Failures are paged, not ignored.

Triton, vLLM, ONNX Runtime — pick what fits
Dynamic batching for higher throughput
p50/p95/p99 latency dashboards out of the box
Cold-start optimization for elastic workloads
Audit log of every request for compliance reviews

python · client.py

# Call your deployed model from any client
import requests

resp = requests.post(
    "https://inference.intsignal.io/order-classifier/v4.2/predict",
    headers={"Authorization": f"Bearer {TOKEN}"},
    json={
        "inputs": [
            {"text": "shipping address is in Quebec"},
            {"text": "customer requested gift wrap"}
        ]
    }
)

result = resp.json()
# {
#   "predictions": [
#     {"label": "international", "confidence": 0.94},
#     {"label": "preference_note", "confidence": 0.87}
#   ],
#   "latency_ms": 32,
#   "model_version": "v4.2"
# }

Cluster on Demand

EPHEMERAL · MULTI-NODE · TEAR-DOWN

Spin up an entire GPU cluster for a specific workload, run it, and tear it down. Useful for one-off large training runs, hyperparameter sweeps, or research bursts where standing capacity doesn't make sense.

Provision a multi-node GPU cluster in minutes
SLURM or Kubernetes scheduler — your choice
Shared high-throughput storage attached automatically
Tear-down releases all resources cleanly
Billed by the hour for the actual cluster lifetime

shell · cluster-up.sh

# Spin up a 32-GPU H100 cluster for a 6-hour sweep
$ intsignal-ai cluster create \
    --name hyperparam-sweep-may \
    --gpu-type h100 \
    --node-count 4 \
    --gpus-per-node 8 \
    --scheduler slurm \
    --shared-storage 10TB \
    --ttl 6h

→ cluster: cl-9b2c4f provisioning
→ ready: ~3 min
→ ssh: cl-9b2c4f.clusters.intsignal.io
→ teardown: 2026-05-21 14:00 UTC (auto)

# Run your sweep via SLURM
$ ssh cl-9b2c4f.clusters.intsignal.io
$ sbatch sweep.slurm

→ submitted 64 jobs across 32 GPUs

Tier 02 · AI API Services

Don't train a model. Call one.

AI Endpoints is a catalog of pre-built model APIs — embeddings, classification, generation, translation, vision, speech. Pick the model, get an API key, ship it. Three response patterns for three latency profiles.

CATALOG

AI Endpoints Model Catalogue

browse · select · subscribe

Curated catalog of open-source and licensed models for common tasks. Each model lists its size, license, recommended use cases, and the GPU it runs on. Pin a version so updates don't break your application.

Model typesLLM · embed · vision · ASR
Versioningpinned per consumer
LicensingOSS + commercial

BASE

AI Endpoints Base API

standard request/response

Synchronous request/response for most use cases. Predictable latency, fair-share scheduling, supports all models in the catalog. The default unless you specifically need fast or batch.

Patternsync request
Use formost workloads
Quotaper-key rate limit

FAST

AI Endpoints Fast API

low-latency · reserved capacity

Same models, but with dedicated capacity and priority scheduling. Pick this for interactive use cases — search-as-you-type, voice assistants, anything where p99 latency matters more than per-call cost.

Patternsync · prioritized
Use forinteractive UX
Capacityreserved

BATCH

AI Endpoints Batch API

offline · throughput-optimized

Submit a list of jobs, get back a results file. Costs less per call because we batch and run when capacity is available. Best for nightly enrichment, bulk classification, or backfill jobs.

Patternasync · file in/out
Use foroffline jobs
SLAcompletion windows

One auth header, every model.

OpenAI-compatible request shape so existing client libraries work without modification. Switch models by changing one parameter.

# Same shape, switch models freely
$ curl https://endpoints.intsignal.io/v1/chat/completions \
    -H "Authorization: Bearer $INTSIGNAL_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "llama-3.3-70b-instruct",
      "messages": [
        {"role": "user", "content": "Summarize this in 3 bullets..."}
      ]
    }'

# → response in ~400ms (Fast API)
# → switch model: "mixtral-8x22b-instruct" — same shape
# → switch to embed: /v1/embeddings — same auth

Tier 03 · GPU Cloud Infrastructure

Pick your card. We run the rest.

Four current-generation NVIDIA GPUs on intSignal infrastructure. We handle the host, the network, the storage, the driver stack, the firmware. You handle your code.

Spec	NVIDIA H200	NVIDIA H100	NVIDIA L40S	NVIDIA L4
Architecture	H200Hopper	H100Hopper	L40SAda Lovelace	L4Ada Lovelace
GPU memory	141 GB HBM3e	80 GB HBM3	48 GB GDDR6 ECC	24 GB GDDR6
Memory bandwidth	~4.8 TB/s	~3.35 TB/s	~864 GB/s	~300 GB/s
Interconnect	NVLink + NVSwitch	NVLink + NVSwitch	PCIe Gen4	PCIe Gen4
MIG support	Yes (up to 7)	Yes (up to 7)	No	No
Best for	100B+ parameter LLM training and inference. Long-context generation. Memory-bound workloads where the H100 starts to swap.	Production LLM training and inference up to ~70B parameters. The current default for serious AI workloads.	Mixed AI and visual workloads — 7B–13B LLMs, diffusion models, vision transformers, multimodal inference, rendering.	Cost-efficient inference for small to mid-size models. Video transcoding, encoding, and content generation pipelines.

Specs reflect NVIDIA's published reference figures. Actual provisioned configurations may vary by node type and region — we'll walk through what fits your workload in the consultation.

What we operate, every day.

Hardening and operating practices aligned to the frameworks your assessors recognize. intSignal is not the certified entity for most of these — we deliver the controls and evidence that make your audit possible. Where required, we partner with FedRAMP-authorized providers for federal scoping.

HARDENING

CIS Benchmark

Engine hardening with documented exceptions.

SOC 2

Aligned to Type II

Controls and evidence cadence ready for audit.

ISO

Aligned to 27001 / 27017

Cloud-services control narratives.

HIPAA

HIPAA-compliant ops

Encryption, access, audit; BAA via partner.

FEDERAL

FedRAMP via partner

Authorized hyperscaler regions integrated.

DATACENTER

Compliant facility

Hosting facility carries its own attestations.

FAQ

Questions data and AI teams ask before signing.

If yours isn't here, ask in the consultation — we'd rather flag the awkward bits early than discover them in production.

AI Platform vs AI Endpoints vs GPU Cloud — which one is right for us?

If you have a data science team building custom models, start with the AI Platform. If you just need to call an existing model (chat, embeddings, vision, etc.), use AI Endpoints. If you want raw GPU access and bring your own software stack, use GPU Cloud. Most customers use two — for example, GPU Cloud for training runs plus AI Endpoints for the parts where a generic model is good enough.

Which GPU should we pick?

For large LLM training (70B+) and long-context inference, H200. For most production LLM and serious AI training, H100. For mixed AI and visual workloads at smaller scale (7B–13B models, diffusion, vision), L40S. For cost-efficient inference and video pipelines, L4. We'll size it with you during onboarding — wrong-fit GPUs waste budget either way.

Can we bring our own model to AI Endpoints?

Yes — but that's actually AI Deploy on the AI Platform, not the Endpoints catalog. If you have a model you've trained or fine-tuned, deploy it via AI Deploy and call it from your application. The Endpoints catalog is for pre-curated models we operate on shared infrastructure.

What about training data — does it leave our control?

Training data stays in your storage. We don't train against customer data, we don't pool data across customers, and we don't share data with model providers. The encryption-at-rest and access-control story is the same as the storage and database services.

Do you support distributed training across many GPUs?

Yes. AI Training supports multi-GPU and multi-node configurations with NVLink/NVSwitch on H100/H200 nodes. PyTorch DDP, FSDP, DeepSpeed, and Megatron-LM are all supported. For very large runs, Cluster on Demand gives you an ephemeral SLURM or Kubernetes cluster sized to the workload.

How does pricing work?

GPU Cloud is billed per GPU-hour. AI Platform jobs are billed by the GPU resources they consume. AI Endpoints are billed per request or per token depending on the model. Reserved capacity and committed-use discounts are available for predictable workloads. We'll model the cost with you against your actual workload shape before you commit.

What frameworks do you support?

PyTorch, TensorFlow, JAX, and HuggingFace Transformers are first-class with maintained container images. For inference, Triton Inference Server, vLLM, ONNX Runtime, and TensorRT-LLM are supported. If you're using something we don't maintain, you can bring your own container image — we operate the GPU and the host, you operate the runtime.

Stop fighting with GPUs. Start shipping models.

Tell us what you're trying to build — fine-tuning, inference at scale, an application powered by a pre-trained model, or pure research. We'll propose the right tier and walk you through the cost model.

Schedule consultation ⟶