AVAILABLE FOR HIRE

Cuong Tran

Site Reliability Engineer

Building resilient systems that scale.
Automating everything in between.

Download Resume

SRE with 3+ years building and scaling distributed systems. Passionate about reducing toil, improving reliability, and making on-call less painful.

// CURRENT_ROLE SRE at GTEL.
Ha Noi, VN · Open to work
3+ years experience
Distributed Systems · SRE · Platform
AVAILABLE_FOR_HIRE
INFRASTRUCTURE
KubernetesDockerTerraformAnsible
CLOUD
AWSGCPAzure
OBSERVABILITY
PrometheusGrafanaELKOpentelemetryJaegerTempo
CI/CD
GitLab CIFluxCDGitHub Actions
LANGUAGES
JavaScriptPythonGoBashHCL

Featured Projects

Open-source tools and platforms built to improve reliability at scale.

K8s Auto-Remediation Engine

ML-powered incident response system for Kubernetes clusters. Automatically detects anomalies and executes remediation playbooks.

GoKubernetesPrometheus
SLO Dashboard Platform

Real-time SLO tracking with budget burn alerts. Integrates with Prometheus and PagerDuty for automated escalation workflows.

PythonGrafanaPostgreSQL
Terraform Module Registry

Internal Infrastructure-as-Code module library with automated testing, versioning, and documentation generation.

TerraformAWSGitHub Actions
Chaos Engineering Toolkit

Automated failure injection framework for distributed systems. Supports network partitions, latency injection, and resource exhaustion scenarios.

GoDockergRPC

Professional Certifications

Industry-recognized credentials in cloud infrastructure, container orchestration, and security.

CKA

Certified Kubernetes Administrator

The Linux Foundation 2026
CKS

Certified Kubernetes Security Specialist

The Linux Foundation 2026
AWS SA Pro

Solutions Architect Professional

AWS 2023
Terraform Assoc.

HashiCorp Terraform Associate

HashiCorp 2023
AWS DevOps Pro

DevOps Engineer Professional

AWS 2023
GCP Pro Architect

Professional Cloud Architect

Google 2022

Where I've Built Reliability

A journey through scaling systems and reducing incidents.

CloudScale Inc. 2022 — Present
Senior Site Reliability Engineer

→ Improved platform SLO from 99.9% to 99.99%, reducing downtime by 90%

→ Built auto-remediation engine reducing MTTR from 45min to 3min

→ Led migration of 200+ microservices to Kubernetes

DataStream Labs 2020 — 2022
Site Reliability Engineer

→ Designed observability stack processing 2M+ events/sec with Prometheus & Grafana

→ Reduced infrastructure costs by 40% through right-sizing and spot instance automation

→ Established SLO framework across 50+ services, improving reliability culture

NexGen Systems 2018 — 2020
DevOps Engineer

→ Implemented CI/CD pipelines with Jenkins & GitLab CI for 30+ services

→ Automated infrastructure provisioning with Terraform, reducing deploy time by 75%

→ Built centralized logging with ELK stack serving 500GB/day of log data

Writing & Thoughts

Incident postmortems, SRE practices, tool reviews, and tutorials.