Results-driven Site Reliability Engineer with 4+ years of experience ensuring high availability and performance for mission-critical payment platforms on AWS. At DXC Technology:
- π» Reduced MTTR by 30% through Python-based automation and structured incident response workflows
- π Cut alert noise by 40% via systematic Datadog monitor optimization β directly improving on-call quality and MTTD
- β Sustained 99.9%+ uptime across 50+ microservices processing millions of financial transactions daily
- π Reduced P1/P2 repeat incidents by 25% through RCA-driven root cause elimination and permanent fixes
Deep expertise in incident command, Kubernetes, CI/CD pipelines, Terraform IaC, and production Java/Spring Boot systems.
- Cloud & Infrastructure: AWS (EC2, S3, VPC, IAM, Auto Scaling), Kubernetes, Docker, Terraform
- Observability & Monitoring: Datadog (APM, Logs, SLOs, Monitors), Splunk, Grafana, New Relic, Dynatrace
- SRE Practices: Incident Management, P1/P2 War Rooms, RCA, SLI/SLO, Error Budgets, Alerting, On-Call
- Programming: Python (automation, log analysis, alerting scripts), Java
- CI/CD & DevOps: Azure DevOps, Jenkins, GitHub Actions, Git, Maven
- Frameworks & Databases: Spring Boot, Spring MVC, Spring Data JPA, Spring Cloud | MySQL, PostgreSQL
- Ticketing Tools: Jira, ServiceNow
- ITIL Practices: Incident, Change, Major Incident, and Problem Management
Proficient in the end-to-end administration of a comprehensive APM and monitoring stack, including:
Tools:
Datadog|Grafana|Kibana|New Relic|Dynatrace|Splunk
- Datadog Administration: Onboarding services, configuring agents, tuning metrics collection, and managing monitors end-to-end.
- Visualization: Designing Datadog dashboards and SLO tracking for real-time visibility across logs, metrics, and APM traces.
- Alerting: Optimizing monitor thresholds to reduce alert noise by 40% β improving MTTD and on-call quality.
- Service Management: Skilled in managing SLOs, SLIs, and SLAs to align IT services with business goals.
- ITIL Practices: Well-versed in ITIL frameworks for Incident, Change, Major Incident, and Problem Management.
| Achievement | Impact |
|---|---|
| π» Reduced MTTR by 30% | Python automation scripts for alert triage, log correlation & incident response at Qatar Airways |
| π Cut alert noise by 40% | Systematic Datadog monitor tuning β improved on-call quality & MTTD |
| β Sustained 99.9%+ uptime | Mission-critical payment infrastructure handling millions of daily international transactions |
| π Reduced repeat incidents by 25% | RCA-driven root cause elimination with permanent corrective fixes |
Client: Qatar Airways β Payments Platform | AWS Β· Datadog Β· Kubernetes Β· Python Β· Java/Spring Boot Β· Microservices
- Managed fault-tolerant AWS infrastructure (EC2, VPC, IAM, S3, Auto Scaling) underpinning 50+ microservices processing high-volume international payment transactions.
- Maintained 99.9%+ uptime for mission-critical financial services, consistently meeting all SLO targets across production environments.
- Reduced MTTR by 30% by engineering Python automation scripts for alert triage, log correlation, and incident response workflows β eliminating repetitive manual investigation steps.
- Optimized Datadog monitors and alerting thresholds, reducing alert noise and false positives by 40%, enabling faster and more accurate incident detection.
- Designed and owned Datadog dashboards and SLO tracking for end-to-end system visibility spanning logs, metrics, and APM traces.
- Led P1/P2 incident war rooms and post-incident root cause analysis (RCA); implemented permanent corrective actions that cut repeat incidents by 25%.
- Deployed and managed containerized workloads on Kubernetes β resolved CrashLoopBackOff failures, tuned resource limits/requests, and implemented HPA for cost-effective auto-scaling.
- Built and maintained CI/CD pipelines via Azure DevOps (Git, Maven), enabling reliable zero-downtime deployments with significantly reduced rollback rates.
- Provisioned and managed AWS resources using Terraform (IaC), improving environment consistency, reducing provisioning errors, and accelerating deployment velocity.
- Partnered with development teams to troubleshoot Java/Spring Boot applications by analyzing JVM metrics, heap dumps, GC logs, and API latency data to resolve production performance bottlenecks.
- Created and escalated Jira & ServiceNow tickets to development teams for faster incident resolution and tracking.
- Prepared structured incident runbooks and playbooks, shared with clients and business stakeholders for operational clarity.
Domain: Enterprise Solutions | Critical Transaction Platforms | Datadog Β· Grafana Β· Python Β· AWS Β· Java/Spring Boot
- Supported mission-critical AWS environments for international enterprise clients; drove SLO/SLI optimization using Datadog and Grafana.
- Built Python automation scripts for alert validation and monitoring health checks, improving team efficiency and reducing noise-driven false escalations.
- Analyzed system logs and cloud deployment patterns to identify recurring failure modes; implemented targeted fixes reducing incident recurrence.
- Coordinated production readiness reviews for new payment services; improved cross-team onboarding documentation and operational runbooks.
- Built an end-to-end observability stack with custom Datadog dashboards, SLO tracking, log pipelines, and APM traces for a personal microservices environment.
- Replicated production-grade alerting patterns to validate and refine monitor configurations β achieving a 40% reduction in alert noise.
- Authored runbooks and incident playbooks as part of an open learning initiative β publicly available at iamdinesh.xyz.
Master of Business Administration (MBA) β JNTU Anantapur (2017 β 2019)
Transitioned into Site Reliability Engineering through self-directed cloud study, hands-on Java/SQL lab work, and professional on-the-job experience.
- π AWS Certified Solutions Architect β Associate (In Progress; Exam Scheduled 2026)