AVAILABLE FOR HIRE

Cuong Tran

Site Reliability Engineer

Building resilient systems that scale.
Automating everything in between.

// ABOUT

SRE with 3+ years building and scaling distributed systems. Passionate about reducing toil, improving reliability, and making on-call less painful.

// CURRENT_ROLE SRE at GTEL.

Ha Noi, VN · Open to work

3+ years experience

Distributed Systems · SRE · Platform

AVAILABLE_FOR_HIRE

// SKILLS_MATRIX

INFRASTRUCTURE

KubernetesDockerTerraformAnsible

CLOUD

AWSGCPAzure

OBSERVABILITY

PrometheusGrafanaELKOpentelemetryJaegerTempo

CI/CD

GitLab CIFluxCDGitHub Actions

LANGUAGES

JavaScriptPythonGoBashHCL

// PROJECTS

Featured Projects

Open-source tools and platforms built to improve reliability at scale.

K8s Auto-Remediation Engine

ML-powered incident response system for Kubernetes clusters. Automatically detects anomalies and executes remediation playbooks.

GoKubernetesPrometheus

SLO Dashboard Platform

Real-time SLO tracking with budget burn alerts. Integrates with Prometheus and PagerDuty for automated escalation workflows.

PythonGrafanaPostgreSQL

Terraform Module Registry

Internal Infrastructure-as-Code module library with automated testing, versioning, and documentation generation.

TerraformAWSGitHub Actions

Chaos Engineering Toolkit

Automated failure injection framework for distributed systems. Supports network partitions, latency injection, and resource exhaustion scenarios.

GoDockergRPC

// CERTIFICATIONS

Professional Certifications

Industry-recognized credentials in cloud infrastructure, container orchestration, and security.

CKA

Certified Kubernetes Administrator

The Linux Foundation 2026

CKS

Certified Kubernetes Security Specialist

The Linux Foundation 2026

AWS SA Pro

Solutions Architect Professional

AWS 2023

Terraform Assoc.

HashiCorp Terraform Associate

HashiCorp 2023

AWS DevOps Pro

DevOps Engineer Professional

AWS 2023

GCP Pro Architect

Professional Cloud Architect

Google 2022

// WORK_TIMELINE

Where I've Built Reliability

A journey through scaling systems and reducing incidents.

CloudScale Inc. 2022 — Present

Senior Site Reliability Engineer

→ Improved platform SLO from 99.9% to 99.99%, reducing downtime by 90%

→ Built auto-remediation engine reducing MTTR from 45min to 3min

→ Led migration of 200+ microservices to Kubernetes

KubernetesTerraformPrometheusGo

DataStream Labs 2020 — 2022

Site Reliability Engineer

→ Designed observability stack processing 2M+ events/sec with Prometheus & Grafana

→ Reduced infrastructure costs by 40% through right-sizing and spot instance automation

→ Established SLO framework across 50+ services, improving reliability culture

AWSPrometheusGrafanaPython

NexGen Systems 2018 — 2020

DevOps Engineer

→ Implemented CI/CD pipelines with Jenkins & GitLab CI for 30+ services

→ Automated infrastructure provisioning with Terraform, reducing deploy time by 75%

→ Built centralized logging with ELK stack serving 500GB/day of log data

JenkinsTerraformELKDocker

// BLOG

Writing & Thoughts

Incident postmortems, SRE practices, tool reviews, and tutorials.

2024-12-15 POSTMORTEM

The Day Our Database Went on Vacation: A 4-Hour Outage Story

How a misconfigured connection pool caused cascading failures across 12 services, and the monitoring gaps that let it happen.

PostgreSQLIncident Response

2024-11-02 SRE_PRACTICES

SLO Budgets That Actually Work: A Practical Framework

Moving beyond theoretical SLOs to actionable error budgets that engineering teams will actually respect and use for decision-making.

SLOReliability

2024-09-18 TOOL_REVIEW

ArgoCD vs Flux: Which GitOps Tool Fits Your Team?

A hands-on comparison after running both in production for a year. Spoiler: it depends on your team size and deployment complexity.

GitOpsKubernetes

2024-07-25 TUTORIAL

Building a Chaos Engineering Pipeline From Scratch

Step-by-step guide to integrating chaos experiments into your CI/CD pipeline with LitmusChaos and GitHub Actions.

Chaos EngineeringCI/CD