Naman Varshney

🚧 Work In Progress - Some content is placeholder/dummy data 🚧

Naman Varshney

Principal Architect · AI Systems & Infrastructure

I design resilient AI platforms that scale — from real-time event pipelines to cost-aware LLM planners

12+ years experience
99.99% uptime targets
AI/Infra scaling expert
₹15+ Cr cost optimizations
Naman Varshney - Professional headshot

🏃‍♂️ HYROX Athlete

"Systems that endure — in code and in sport."

Bangalore Solo – April 2026

👨‍💻

About

Professional

I'm Naman Varshney, Principal Architect with 12+ years building distributed systems. Currently leading AI & infra scaling at TripFactory, where I've reduced costs by ₹15+ Cr and maintained 99.99% uptime across 15+ microservices serving 500K+ users.

My expertise spans from high-frequency trading systems to AI-powered platforms. I believe in choosing boring technology that scales, then optimizing the hell out of it.

Personal

Naman at HYROX race

🏃‍♂️ HYROX Athlete

Next race: Bangalore Solo – April 2026

👨‍👩‍👧‍👦 Father of two

Teaching prioritization & efficiency

🏸 Sports

HYROX, Badminton, Cricket, Running

"The same grit and discipline I bring to competition, I bring to designing reliable systems."

Skills

🏗️ Production Stack Architecture

Apple Watch → Kafka → Feature Store → LLM Router → Dashboard

Real-time event processing with intelligent routing

💻 Languages

PythonTypeScriptGoSwiftJavaRust

⚙️ Infra/Streaming

KafkaRedisPostgreSQLDockerKubernetesMicroservicesEvent Sourcing

🤖 AI/LLM

CrewAILangChainLLMsRAG SystemsOpenAI APIAI AgentsFunction CallingChat BotsPrompt Engineering

☁️ Cloud/DevOps

AWSGCPAzureTerraformGrafanaPrometheusELK Stack
💼

Experience

TripFactory — Principal Architect

2023–Present

Context:

Large travel commerce codebase (search, pricing, booking, payments) with latency & cost issues.

Actions:

  • Carved domain services (search/pricing/payments)
  • Introduced BFF (Next.js) + caching strategies; added observability (SLOs, tracing)
  • Piloted LLM assistants: log triage & catalog normalization (RAG)

Results:

P95 latency ↓ ~35%Infra cost ↓ ~₹15+ CrIncidents ↓ ~60%

Stack:

LLMs(Open AI, Gemini, Anthropic)CrewAILangGraphLangChainJavaPythonPostgresRedisKafkaAWS/GCP

Artifacts: Before/after latency chart, service map, SLO dashboard

Vedantu — Payments & Reliability

2020–2022

Context:

Scale for live-class peaks; sensitive payments/refunds.

Actions:

  • Split monolith into idempotent Spring services; release trains + trunk-based dev
  • Event analytics (Kafka → BigQuery); operational runbooks
  • Built payment reconciliation and fraud detection systems

Results:

Peak-day stability (2–3× traffic)Chargebacks ↓ ~40%Lead time to prod: hours → minutes

Stack:

Spring BootPostgresRedisKafkaBigQueryGrafanaDatadogMongoDB

Artifacts: Idempotent payment flow diagram; dashboard red→green story

Via.com / EbixCash — Mobile + B2B Modules

2014–2020

Context:

Multi-brand travel apps and B2B agent portal.

Actions:

  • Built Android & iOS apps from scratch, modularized for multi-brand rollouts
  • Implemented dual authentication for B2B portal; incentive engine for agents
  • Led mobile team scaling from 2 to 8 engineers

Results:

Faster market launchesFraud reduction ~25%Higher agent activation rates

Stack:

Android (Java/Kotlin)iOS (Objective-C)SpringPostgresRedisKafka

Artifacts: App flows video; incentives ERD; API documentation

Shoppoke — First Employee, Full-Stack

Zero-to-one startup experience
2013–2014

Context:

Zero-to-one marketplace matching shopper requests to nearby retailers.

Actions:

  • Shaped product with founder; built Android app + server APIs end-to-end
  • Led a small engineering team; shipped consulting work (AxisRooms, Manipal Hospital) to bootstrap
  • Established engineering practices and deployment pipelines

Results:

First live pilotsValidated local-retail messaging loopEarly B2B revenue via consulting

Stack:

Android (Java)REST APIs (Java/Spring)PostgresRuby On Rails

Artifacts: Early architecture sketch, first-release screenshots

🚀

Flagship Case Studies

Three real systems, each with problem, architecture, and measurable results.

🎯 AI Orchestration Platform

Multi-Model LLM Governance & Routing

🎯 The Challenge

Teams were building AI features in silos with inconsistent models, costs, and governance. No centralized way to route requests, manage fallbacks, or ensure compliance across multiple LLM providers.

🏗️ Architecture

API Gateway → Model Router → Fallback Chain → Cost Optimizer → Analytics Dashboard

📊 Results

Cost Reduction
-40%
Uptime
99.9%
Providers
5+
Dev Speed
+50%

🛠️ Tech Stack

PythonFastAPIRedisPostgreSQLOpenAIAnthropicLangChainDocker

🏃‍♂️ HYROX Coach AI

Intelligent Training Platform

💡 Built from my own HYROX journey - create custom HYROX simulations, track them on your watch, and get AI-driven training plans. The app lets you design your own HYROX-style workouts and start tracking immediately on your Apple Watch.

🎯 The Challenge

Athletes needed training plans that adapt in real time. Existing solutions were one-size-fits-all and couldn’t capture the unique demands of HYROX — let alone track custom simulations and workouts straight from the watch.

🏗️ Architecture

Apple Watch → Kafka Streams → Feature Store → LLM Planner → Web/Mobile

📊 Results

Latency
-65%
Token Cost
-42%
Accuracy
+14% Z2
Watch Tracking
Custom

🛠️ Tech Stack

SwiftTypeScriptKafkaOpenAIHealthKitPrisma

🚌 Vehicle Routing & Allocation

AI-Powered Multi-Modal Transport Optimization

🎯 The Challenge

Daily ground operations across airport ↔ hotel routes required complex routing with 45% dead kilometers, 25% service delays, and ₹2L+ daily fuel waste. Needed real-time optimization honoring time windows, vehicle constraints, and live traffic updates.

🏗️ Architecture

Booking API → Normalizer → Pooler → Route Optimizer → Vehicle Allocator

📊 Results

On-Time Pickup
+8%
Customer Wait
-25%
Dead Kilometers
-12%
Gross Margin
+4%

🛠️ Tech Stack

PythonOR-ToolsPostgreSQLRedisKafkaGoogle Maps API

⚡ AI Supplier Negotiation Engine

TripFactory Cost Optimization Platform

🎯 The Challenge

Manual supplier negotiations were time-intensive and inconsistent, leading to suboptimal pricing.

🏗️ Architecture

Event Stream → ML Models → Negotiation Logic → Supplier APIs → Analytics Dashboard

📊 Results

Margins
+23%
Cost Savings
₹8+ Cr
Cycle Time
-85%
Automation
70%

🛠️ Tech Stack

PythonKafkaTensorFlowFastAPIRedisPostgreSQL
🎯 Enterprise-scale cost optimization

🌊 Real-time Travel Recommendation Engine

Personalized Travel Discovery at Scale

🎯 The Challenge

Static recommendation systems couldn't adapt to real-time user behavior and market dynamics.

🏗️ Architecture

User Events → Kafka → Feature Engineering → ML Pipeline → Recommendation API

📊 Results

Conversion
+40%
Events/Day
10M+
Latency
<100ms
Users
500K+

🛠️ Tech Stack

PythonKafkaSparkRedisElasticsearchDocker
🎯 Production system at TripFactory scale

📊 Kafka Streaming Infrastructure

Enterprise Event Processing Platform

🎯 The Challenge

Legacy batch processing couldn't handle real-time analytics and event-driven architecture needs.

🏗️ Architecture

Multi-DC Kafka → Stream Processing → Real-time Analytics → Monitoring Dashboard

📊 Results

Uptime
99.99%
Throughput
10M+/day
Latency
<1s
Data Loss
0%

🛠️ Tech Stack

KafkaKafka StreamsConfluentGrafanaPrometheusKubernetes
🎯 Foundation for all real-time systems
📚

Research Publications

Developing and Testing the Automated Post-Event Earthquake Loss Estimation and Visualisation (APE-ELEV) Technique

Anthony Astoul, Christopher Filliter, Eric Mason, Andrew Rau-Chaplin, Kunal Shridhar, Blesson Varghese and Naman Varshney

Natural Hazards and Earth System Sciences2013DOI: 10.5194/nhess-13-1885-2013

An automated, real-time, multiple sensor data source relying and globally applicable earthquake loss model and visualiser for post-event earthquake analysis. The system supports rapid data ingestion, loss estimation and integration of data from multiple sources with rapid visualisation at multiple geographic levels.

Keywords:

Earthquake ModellingPost-Event AnalysisInsured Loss EstimationLoss VisualisationReal-time Systems

Impact: Real-time earthquake loss estimation system validated for ten global earthquakes using industry loss data

A Framework for Real-time Earthquake Loss Estimation and Visualisation

Naman Varshney, Anthony Astoul, Christopher Filliter, Eric Mason, Andrew Rau-Chaplin, Kunal Shridhar, Blesson Varghese

Proceedings of the 15th World Conference on Earthquake Engineering2012

A comprehensive framework for automated post-event earthquake loss estimation combining multiple data sources with real-time visualisation capabilities. The system demonstrates feasibility using the 2011 Tohoku earthquake case study.

Keywords:

Earthquake EngineeringReal-time SystemsLoss EstimationData IntegrationVisualisation

Impact: Framework for rapid post-earthquake response with multi-source data integration

📧

Let's Work Together

Ready to Build Something Amazing?

Whether you need to scale AI systems, optimize infrastructure costs, or build reliable distributed platforms, I'd love to discuss how we can work together.

© 2025 Naman Varshney. Built with Next.js, Tailwind, and the same attention to detail I bring to production systems.

🏃‍♂️ Currently training for Bangalore Solo – April 2026