📬 Contact Me

Site Reliability Engineer • Observability • Java App Support • DevOps Engineer

🎯 Professional Summary

Results-driven Site Reliability Engineer with 4+ years of experience ensuring high availability and performance for mission-critical payment platforms on AWS. At DXC Technology:

🔻 Reduced MTTR by 30% through Python-based automation and structured incident response workflows
🔕 Cut alert noise by 40% via systematic Datadog monitor optimization — directly improving on-call quality and MTTD
✅ Sustained 99.9%+ uptime across 50+ microservices processing millions of financial transactions daily
🔄 Reduced P1/P2 repeat incidents by 25% through RCA-driven root cause elimination and permanent fixes

Deep expertise in incident command, Kubernetes, CI/CD pipelines, Terraform IaC, and production Java/Spring Boot systems.

🔑 Key Skills

Cloud & Infrastructure: AWS (EC2, S3, VPC, IAM, Auto Scaling), Kubernetes, Docker, Terraform
Observability & Monitoring: Datadog (APM, Logs, SLOs, Monitors), Splunk, Grafana, New Relic, Dynatrace
SRE Practices: Incident Management, P1/P2 War Rooms, RCA, SLI/SLO, Error Budgets, Alerting, On-Call
Programming: Python (automation, log analysis, alerting scripts), Java
CI/CD & DevOps: Azure DevOps, Jenkins, GitHub Actions, Git, Maven
Frameworks & Databases: Spring Boot, Spring MVC, Spring Data JPA, Spring Cloud | MySQL, PostgreSQL
Ticketing Tools: Jira, ServiceNow
ITIL Practices: Incident, Change, Major Incident, and Problem Management

Monitoring & Observability

Proficient in the end-to-end administration of a comprehensive APM and monitoring stack, including:

Tools: Datadog | Grafana | Kibana | New Relic | Dynatrace | Splunk

Datadog Administration: Onboarding services, configuring agents, tuning metrics collection, and managing monitors end-to-end.
Visualization: Designing Datadog dashboards and SLO tracking for real-time visibility across logs, metrics, and APM traces.
Alerting: Optimizing monitor thresholds to reduce alert noise by 40% — improving MTTD and on-call quality.

Process & Framework

Service Management: Skilled in managing SLOs, SLIs, and SLAs to align IT services with business goals.
ITIL Practices: Well-versed in ITIL frameworks for Incident, Change, Major Incident, and Problem Management.

🏆 Key Achievements

Achievement	Impact
🔻 Reduced MTTR by 30%	Python automation scripts for alert triage, log correlation & incident response at Qatar Airways
🔕 Cut alert noise by 40%	Systematic Datadog monitor tuning — improved on-call quality & MTTD
✅ Sustained 99.9%+ uptime	Mission-critical payment infrastructure handling millions of daily international transactions
🔄 Reduced repeat incidents by 25%	RCA-driven root cause elimination with permanent corrective fixes

💼 Professional Experience

DXC Technology, Bangalore — Site Reliability Engineer (Dec 2022 – Present)

Client: Qatar Airways — Payments Platform | AWS · Datadog · Kubernetes · Python · Java/Spring Boot · Microservices

Managed fault-tolerant AWS infrastructure (EC2, VPC, IAM, S3, Auto Scaling) underpinning 50+ microservices processing high-volume international payment transactions.
Maintained 99.9%+ uptime for mission-critical financial services, consistently meeting all SLO targets across production environments.
Reduced MTTR by 30% by engineering Python automation scripts for alert triage, log correlation, and incident response workflows — eliminating repetitive manual investigation steps.
Optimized Datadog monitors and alerting thresholds, reducing alert noise and false positives by 40%, enabling faster and more accurate incident detection.
Designed and owned Datadog dashboards and SLO tracking for end-to-end system visibility spanning logs, metrics, and APM traces.
Led P1/P2 incident war rooms and post-incident root cause analysis (RCA); implemented permanent corrective actions that cut repeat incidents by 25%.
Deployed and managed containerized workloads on Kubernetes — resolved CrashLoopBackOff failures, tuned resource limits/requests, and implemented HPA for cost-effective auto-scaling.
Built and maintained CI/CD pipelines via Azure DevOps (Git, Maven), enabling reliable zero-downtime deployments with significantly reduced rollback rates.
Provisioned and managed AWS resources using Terraform (IaC), improving environment consistency, reducing provisioning errors, and accelerating deployment velocity.
Partnered with development teams to troubleshoot Java/Spring Boot applications by analyzing JVM metrics, heap dumps, GC logs, and API latency data to resolve production performance bottlenecks.
Created and escalated Jira & ServiceNow tickets to development teams for faster incident resolution and tracking.
Prepared structured incident runbooks and playbooks, shared with clients and business stakeholders for operational clarity.

Wipro Ltd — Site Reliability Engineer (Apr 2022 – Nov 2022)

Domain: Enterprise Solutions | Critical Transaction Platforms | Datadog · Grafana · Python · AWS · Java/Spring Boot

Supported mission-critical AWS environments for international enterprise clients; drove SLO/SLI optimization using Datadog and Grafana.
Built Python automation scripts for alert validation and monitoring health checks, improving team efficiency and reducing noise-driven false escalations.
Analyzed system logs and cloud deployment patterns to identify recurring failure modes; implemented targeted fixes reducing incident recurrence.
Coordinated production readiness reviews for new payment services; improved cross-team onboarding documentation and operational runbooks.

🚀 Personal Projects

Real-Time Observability Platform (Datadog)

Built an end-to-end observability stack with custom Datadog dashboards, SLO tracking, log pipelines, and APM traces for a personal microservices environment.
Replicated production-grade alerting patterns to validate and refine monitor configurations — achieving a 40% reduction in alert noise.
Authored runbooks and incident playbooks as part of an open learning initiative — publicly available at iamdinesh.xyz.