sep 2024 — presentcurrent
Senior Site Reliability Engineer (SRE)
TrueFoundry · India
Own platform reliability, observability, incident response, and customer success for a multi-tenant SaaS platform serving enterprise ML workloads across AWS, GCP, Azure, and on-prem. Partner directly with global enterprise customers on onboarding and long-term operational health.
- —Built a modular Terraform framework that cut client onboarding time by 70% across AWS, Azure, GCP, and on-prem; complemented by AWS/GCP/Azure Marketplace listings that drove a 40% increase in self-service customer acquisition
- —Migrated the platform's logging stack from Grafana Loki to VictoriaLogs — 94% lower query latency, ~40% smaller storage footprint, 3x ingestion throughput, ~50% lower CPU/RAM; published as a public engineering benchmark
- —Led a platform-wide monitoring overhaul — bifurcated alerts into P0/P1 severity tiers and migrated critical components to New Relic after a successful PoC, improving signal quality with flat observability spend
- —Architected a severity-tiered incident management and on-call system integrating Sentry, Grafana, and New Relic with Zenduty and Slack, with team-wise routing across five functional domains — materially reducing MTTR and on-call noise
- —Drove infrastructure standardization across the fleet (security contacts, K8s labels/annotations, resource conventions) and hardened multi-tenant SaaS via tighter tenant isolation, namespace-scoped RBAC, and resource governance
- —Escalation point for complex multi-cloud production debugging across EKS/GKE/AKS — Karpenter, EFS/CSI, GPU node scheduling, IAM/IRSA, networking, airgapped artifact registries — while keeping enterprise SLAs intact
- —Routinely support late-IST and weekend windows for onboarding, cluster upgrades, and live debugging with global enterprise customers (e.g., Riot Games, Zscaler)
#kubernetes#terraform#aws#gcp#azure#victorialogs#new relic#karpenter
Infrastructure Lead (Founding Engineer)
Primetrace (Kutumb Crafto) · India
Architected and operated AWS-based infrastructure on self-hosted Kubernetes supporting 4M DAU at 1M RPM peak. Built monitoring on RED/USE methodologies with Prometheus, Grafana, Elastic-APM, Pyroscope, Loki, and Robusta.
- —Reduced annual cloud spend by $200K via self-hosted services, efficient pod/node binpacking, spot-instance utilization, and trimming network flow logs and metrics volume
- —Designed and maintained highly available multi-broker Apache Kafka clusters and a multi-node ELK Stack on spot instances, balancing data resilience with cost
- —Hardened cluster security through comprehensive network policies and RBAC configurations, materially reducing the attack surface
- —Cut deployment times to under one minute using self-hosted GitHub Runners on spot instances integrated with ArgoCD and Devtron
- —Expanded global reach by rolling out IPv6 across the stack and integrating AWS Global Accelerator to reduce latency for international markets
- —Achieved 99.99% uptime for stateless workloads on 95% spot instances via time-based scaling and node hibernation during low-traffic windows
- —Led integration of a full observability stack: centralized monitoring, APM, distributed tracing, logging, profiling, and alerting
#aws#kubernetes#kafka#elk#prometheus#grafana#argocd#spot instances
DevOps Engineer
smallcase · India
Managed core infrastructure (OpenVPN, Heartbeat, Elastalert, Grafana) and integrated developer tooling (Wiki.js, Sentry, JFrog Artifactory). Built Ansible automation for deployments, log rotation, backups, and monitoring agents; tuned CI/CD on Jenkins and AWS CodeDeploy.
- —Drove infrastructure cost optimization that delivered a 90% reduction in operational expenses while maintaining service quality
- —Implemented HAProxy load balancing to improve network performance and reduce application response times
- —Improved developer productivity by integrating open-source observability tools — Elastic-APM, Kafka-manager, Kafka-topics-ui
- —Designed fault-tolerant data infrastructure on multi-broker Apache Kafka clusters and a distributed ELK Stack running on spot instances
#ansible#jenkins#aws#haproxy#kafka#elk
apr 2018 — presentcurrent
Technical Mentor
Udacity · Remote
- —Mentor students in Android and Python Development; review project submissions and provide feedback to improve technical depth and code quality
Open Source Contributor
Utopian.io · Remote
- —Performed bug reviews and QA across multiple open-source projects, contributing to improved reliability and security alongside a global contributor community