ai-ta

Autonomous Infrastructure Triage Agent

AI investigates. Humans decide. Infrastructure learns.

Platform ops runs 8-5. Infrastructure runs 24/7.

The 15-hour gap means overnight incidents go undetected until morning — often after they've cascaded into application outages that wake up the wrong team. NIS2 Article 20 makes the board personally liable for incident response capability you don't have outside business hours.

Alert → Investigate → Resolve → Learn → Repeat

AlertGrafana / Any webhook
InvestigateSSH, metrics, logs, Docker
Resolve or Escalatedestructive = human approval
Learnknowledge accumulates
Repeat24/7
Without ai-taWith ai-ta
3am disk alertEngineer wakes up, SSHes in, investigatesAgent investigates, requests approval, cleans up. Morning summary.
Recurring issueReinvestigated from scratch every timeAuto-resolved in milliseconds from learned knowledge — no LLM cost
NIS2 auditScramble to reconstruct timelineTimestamped trail: detection, investigation, action, approval
New team memberWeeks of shadowing for tribal knowledgeQueries accumulated knowledge from any AI tool
ai-ta dashboard
Dashboard — situation briefing, knowledge auto-resolve, triage history
ai-ta settings
Settings — integrations, governance, single YAML config

Real data from production

251Triages last 30dfully autonomous
84Auto-resolvedzero human effort
~55sAvg investigationvs 30-45 min manual
~$52Monthly LLM costall-in, 251 triages
99.91%Uptime (YTD)production SLA

NIS2 & GDPR ready

Incident handling (Art. 21.2b)Full lifecycle: detect, investigate, respond, document
Continuous monitoring (Art. 21.2a)24/7 autonomous triage with trend detection
Incident reporting (Art. 23)24h early warning, 72h detail, monthly summary
Supply chain oversight (Art. 21.2d)Independent monitoring of MSP-managed infra
Board accountability (Art. 20)AI policy version-controlled, commit-stamped
GDPR data minimizationConfigurable retention + automated purge per data type

Works with what you have

MonitorPrometheusGrafanaLokiHealthchecks.io
InfraDockerKubernetesOpenShift
NotifySlackTeamsServiceNowntfyEmail
AIMCP protocolCode modeMCP elicitation
AuthOIDCBearer tokensApproval gates

Software stops being the constraint

The operator's job becomes steering processes, owning accountability, and managing risk — not navigating dashboards. Three shifts are converging:

Machine-to-machine by default

Monitoring talks to triage, triage talks to remediation, remediation talks to approval gates. Humans intervene at decision points, not execution points. Contracts and schemas enforce determinism between autonomous systems.

Frontends become contextual projections

What matters isn't how the UI looks — it's that the right information reaches the right person at the right moment. A morning email, a mobile approval, an MCP query from another agent — all valid interfaces.

Governance becomes the product

When AI acts autonomously, the value is the audit trail, the approval gates, the policy enforcement, the explainability. The strictest compliance environments aren't obstacles — they're the reason it must be built this way.

Governance-first autonomous operations

Contract-driven
MCP protocol: any AI agent queries and acts on infrastructure knowledge through structured contracts. Not screen scrapes. Not API wrappers.
Process-native
The same triage result renders as an HTML email, an approval webhook, an MCP response, or a terminal session — whatever the process demands.
Self-learning
Every triage cycle feeds the next. Known patterns resolved faster. Cross-run trends flagged before alerts fire. Knowledge survives team turnover.
Compliance-first
Every action timestamped. Every decision traceable. Every destructive command gated. Every AI execution governed by version-controlled organizational policy.
Cross-industry
Declarative, pluggable governance. A maritime operator, a hospital, and a fintech run the same agent with different policy files.

ai-ta contains incidents. It never makes irreversible decisions.

Observes across every layer. Contains with human approval. The boundary is reversibility and blast radius.

What ai-ta does
ObserveL7 edge (Cloudflare WAF), L3/L4 perimeter (firewall), L2 network, host & container — all layers, no limits
ContainBlock at edge, isolate at network, restart services — every action human-approved, auto-expiring
LearnAccumulate knowledge, detect patterns, skip known-good investigations, flag trends
DocumentFull audit trail from detection to resolution — timestamped, traceable, exportable
What ai-ta never touches
DataNo database operations, backup restores, or storage modifications
IdentityNo credential rotation, access policy, or permission changes
DNSSlow to reverse, high blast radius — stays with the human
HypervisorBare metal and VM lifecycle are last-resort human decisions
ai-ta component topology
ai-ta component topology — single binary, single config, zero orchestration dependencies

Single container

  • One Go binary, one YAML config, one SQLite database
  • Deploys in minutes — Docker, Kubernetes, or OpenShift
  • Your data stays in your network
  • AI governance from a git repo your CISO controls
  • Helm chart ready for enterprise orchestrators

Deploy in minutes. No sales call.

ai-ta is source-available and free to evaluate. Pull the image, point it at your infrastructure, get your first triage report. Subscribe when you're ready for production support.

AITA for letting an AI handle my on-call?

NTA. Your infra, your rules.

Live demo on running infrastructure. Not slides — the real system investigating real alerts.

Download pitch deck (PDF)

Let's talk

Ready to see autonomous infrastructure triage on your infrastructure?