All projects
Nov 1, 2025

Network Investigation Agentic Platform

An AI-native operational intelligence platform that consolidates network diagnostics, telemetry intelligence, incident management and AI agents into a single workspace — collapsing 30–90 min investigations into 1–3 min.

AI AgentsMCPFastAPIReactTypeScriptTelemetry

The problem

In large enterprise infrastructure environments, when a customer hits a routing, BGP or WAN issue, engineers have to manually pivot across many database clusters, disconnected telemetry systems, runbooks and dashboards — often spending 30 to 90 minutes correlating signal across systems before they can even form a hypothesis.

What I built

A full-stack, AI-native operational platform that brings network diagnostics, telemetry intelligence, incident management, analytics and AI troubleshooting into a single workspace.

I architected and led engineering on the platform end-to-end:

  • Autonomous diagnostic agent (GPT-5 + Claude Opus) — understands natural-language troubleshooting requests, selects the right database clusters, chains diagnostic workflows in sequence, correlates telemetry across systems and produces structured TSG-grade summaries.
  • MCP tooling layer200+ diagnostic tools organized as a structured, AI-callable hierarchy spanning every operational domain.
  • Operational dashboards — incident analytics, risk assessment, topology intelligence, incident-quality metrics, SLA monitoring and executive summary views.
  • Conversational ops bot — multi-turn conversations, adaptive cards, change-request automation, HTML-rendered operational responses.

Technical highlights

MetricValue
Production code100K+ lines
Database / telemetry clusters19
Validated diagnostic tools180+
MCP tools surfaced to agents200+
React components130+
Operational APIs70+
Multi-tier Redis cachingyes
Parallel cross-cluster queryingyes
OpenTelemetry tracingyes
Multi-tenant SSO / OIDC authyes

Stack

Python · FastAPI · React 18 · TypeScript · GPT-5 · Claude Opus · MCP · Database clusters · Redis · Serverless Functions · Cloud Identity · Docker · OpenTelemetry.

Business impact

  • Troubleshooting time: 30–90 min → 1–3 min.
  • 10–30× faster incident diagnostics.
  • Thousands of engineering hours saved annually.
  • Standardized troubleshooting methodology across the org.
  • Institutional operational knowledge encoded into AI workflows.
  • Significantly faster onboarding for junior engineers.

Engineering philosophy

The platform succeeds because the agents only succeed: well-shaped MCP tools, clean cluster routing, encoded runbooks. Most enterprise AI bets fail because the tools given to the agents are weak, not the models.