→ Improved platform SLO from 99.9% to 99.99%, reducing downtime by 90%
→ Built auto-remediation engine reducing MTTR from 45min to 3min
→ Led migration of 200+ microservices to Kubernetes
Site Reliability Engineer
Building resilient systems that scale.
Automating everything in between.
SRE with 3+ years building and scaling distributed systems. Passionate about reducing toil, improving reliability, and making on-call less painful.
Open-source tools and platforms built to improve reliability at scale.
ML-powered incident response system for Kubernetes clusters. Automatically detects anomalies and executes remediation playbooks.
Real-time SLO tracking with budget burn alerts. Integrates with Prometheus and PagerDuty for automated escalation workflows.
Internal Infrastructure-as-Code module library with automated testing, versioning, and documentation generation.
Automated failure injection framework for distributed systems. Supports network partitions, latency injection, and resource exhaustion scenarios.
Industry-recognized credentials in cloud infrastructure, container orchestration, and security.
Certified Kubernetes Administrator
Certified Kubernetes Security Specialist
Solutions Architect Professional
HashiCorp Terraform Associate
DevOps Engineer Professional
Professional Cloud Architect
A journey through scaling systems and reducing incidents.
→ Improved platform SLO from 99.9% to 99.99%, reducing downtime by 90%
→ Built auto-remediation engine reducing MTTR from 45min to 3min
→ Led migration of 200+ microservices to Kubernetes
→ Designed observability stack processing 2M+ events/sec with Prometheus & Grafana
→ Reduced infrastructure costs by 40% through right-sizing and spot instance automation
→ Established SLO framework across 50+ services, improving reliability culture
→ Implemented CI/CD pipelines with Jenkins & GitLab CI for 30+ services
→ Automated infrastructure provisioning with Terraform, reducing deploy time by 75%
→ Built centralized logging with ELK stack serving 500GB/day of log data
Incident postmortems, SRE practices, tool reviews, and tutorials.
How a misconfigured connection pool caused cascading failures across 12 services, and the monitoring gaps that let it happen.
Moving beyond theoretical SLOs to actionable error budgets that engineering teams will actually respect and use for decision-making.
A hands-on comparison after running both in production for a year. Spoiler: it depends on your team size and deployment complexity.
Step-by-step guide to integrating chaos experiments into your CI/CD pipeline with LitmusChaos and GitHub Actions.