Skip to content

dineshc227/Profile

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 

Repository files navigation

header

Site Reliability Engineer β€’ Observability β€’ Java App Support β€’ DevOps Engineer

Email Mobile Portfolio

🎯 Professional Summary

Results-driven Site Reliability Engineer with 4+ years of experience ensuring high availability and performance for mission-critical payment platforms on AWS. At DXC Technology:

  • πŸ”» Reduced MTTR by 30% through Python-based automation and structured incident response workflows
  • πŸ”• Cut alert noise by 40% via systematic Datadog monitor optimization β€” directly improving on-call quality and MTTD
  • βœ… Sustained 99.9%+ uptime across 50+ microservices processing millions of financial transactions daily
  • πŸ”„ Reduced P1/P2 repeat incidents by 25% through RCA-driven root cause elimination and permanent fixes

Deep expertise in incident command, Kubernetes, CI/CD pipelines, Terraform IaC, and production Java/Spring Boot systems.


πŸ”‘ Key Skills

  • Cloud & Infrastructure: AWS (EC2, S3, VPC, IAM, Auto Scaling), Kubernetes, Docker, Terraform
  • Observability & Monitoring: Datadog (APM, Logs, SLOs, Monitors), Splunk, Grafana, New Relic, Dynatrace
  • SRE Practices: Incident Management, P1/P2 War Rooms, RCA, SLI/SLO, Error Budgets, Alerting, On-Call
  • Programming: Python (automation, log analysis, alerting scripts), Java
  • CI/CD & DevOps: Azure DevOps, Jenkins, GitHub Actions, Git, Maven
  • Frameworks & Databases: Spring Boot, Spring MVC, Spring Data JPA, Spring Cloud | MySQL, PostgreSQL
  • Ticketing Tools: Jira, ServiceNow
  • ITIL Practices: Incident, Change, Major Incident, and Problem Management

Monitoring & Observability

Proficient in the end-to-end administration of a comprehensive APM and monitoring stack, including:

Datadog Grafana Kibana New Relic Dynatrace Splunk

Tools: Datadog | Grafana | Kibana | New Relic | Dynatrace | Splunk

  • Datadog Administration: Onboarding services, configuring agents, tuning metrics collection, and managing monitors end-to-end.
  • Visualization: Designing Datadog dashboards and SLO tracking for real-time visibility across logs, metrics, and APM traces.
  • Alerting: Optimizing monitor thresholds to reduce alert noise by 40% β€” improving MTTD and on-call quality.

Process & Framework

  • Service Management: Skilled in managing SLOs, SLIs, and SLAs to align IT services with business goals.
  • ITIL Practices: Well-versed in ITIL frameworks for Incident, Change, Major Incident, and Problem Management.

πŸ† Key Achievements

Achievement Impact
πŸ”» Reduced MTTR by 30% Python automation scripts for alert triage, log correlation & incident response at Qatar Airways
πŸ”• Cut alert noise by 40% Systematic Datadog monitor tuning β€” improved on-call quality & MTTD
βœ… Sustained 99.9%+ uptime Mission-critical payment infrastructure handling millions of daily international transactions
πŸ”„ Reduced repeat incidents by 25% RCA-driven root cause elimination with permanent corrective fixes

πŸ’Ό Professional Experience


DXC Technology, Bangalore β€” Site Reliability Engineer (Dec 2022 – Present)

Client: Qatar Airways β€” Payments Platform | AWS Β· Datadog Β· Kubernetes Β· Python Β· Java/Spring Boot Β· Microservices

  • Managed fault-tolerant AWS infrastructure (EC2, VPC, IAM, S3, Auto Scaling) underpinning 50+ microservices processing high-volume international payment transactions.
  • Maintained 99.9%+ uptime for mission-critical financial services, consistently meeting all SLO targets across production environments.
  • Reduced MTTR by 30% by engineering Python automation scripts for alert triage, log correlation, and incident response workflows β€” eliminating repetitive manual investigation steps.
  • Optimized Datadog monitors and alerting thresholds, reducing alert noise and false positives by 40%, enabling faster and more accurate incident detection.
  • Designed and owned Datadog dashboards and SLO tracking for end-to-end system visibility spanning logs, metrics, and APM traces.
  • Led P1/P2 incident war rooms and post-incident root cause analysis (RCA); implemented permanent corrective actions that cut repeat incidents by 25%.
  • Deployed and managed containerized workloads on Kubernetes β€” resolved CrashLoopBackOff failures, tuned resource limits/requests, and implemented HPA for cost-effective auto-scaling.
  • Built and maintained CI/CD pipelines via Azure DevOps (Git, Maven), enabling reliable zero-downtime deployments with significantly reduced rollback rates.
  • Provisioned and managed AWS resources using Terraform (IaC), improving environment consistency, reducing provisioning errors, and accelerating deployment velocity.
  • Partnered with development teams to troubleshoot Java/Spring Boot applications by analyzing JVM metrics, heap dumps, GC logs, and API latency data to resolve production performance bottlenecks.
  • Created and escalated Jira & ServiceNow tickets to development teams for faster incident resolution and tracking.
  • Prepared structured incident runbooks and playbooks, shared with clients and business stakeholders for operational clarity.

Wipro Ltd β€” Site Reliability Engineer (Apr 2022 – Nov 2022)

Domain: Enterprise Solutions | Critical Transaction Platforms | Datadog Β· Grafana Β· Python Β· AWS Β· Java/Spring Boot

  • Supported mission-critical AWS environments for international enterprise clients; drove SLO/SLI optimization using Datadog and Grafana.
  • Built Python automation scripts for alert validation and monitoring health checks, improving team efficiency and reducing noise-driven false escalations.
  • Analyzed system logs and cloud deployment patterns to identify recurring failure modes; implemented targeted fixes reducing incident recurrence.
  • Coordinated production readiness reviews for new payment services; improved cross-team onboarding documentation and operational runbooks.

πŸš€ Personal Projects

Real-Time Observability Platform (Datadog)

  • Built an end-to-end observability stack with custom Datadog dashboards, SLO tracking, log pipelines, and APM traces for a personal microservices environment.
  • Replicated production-grade alerting patterns to validate and refine monitor configurations β€” achieving a 40% reduction in alert noise.
  • Authored runbooks and incident playbooks as part of an open learning initiative β€” publicly available at iamdinesh.xyz.

πŸ› οΈ Technical Stack

πŸ“Š Monitoring & Observability

Datadog Grafana Kibana New Relic Dynatrace Splunk

🎫 Ticketing Systems

JIRA ServiceNow

☁️ Cloud & Infrastructure

AWS Kubernetes Docker Terraform

πŸ’» Programming Languages

Python Java

πŸ—„οΈ Databases

MySQL PostgreSQL

πŸ”„ CI/CD

Jenkins GitHub Actions Azure DevOps

πŸ“‹ Practices & Frameworks

SLOs SLIs SLAs ITIL Agile SRE Incident Management Problem Management Change Management Error Budgets

🎯 Java Ecosystem

Spring Boot Spring MVC Spring Data JPA Spring Cloud

πŸ–₯️ Operating Systems

Windows Ubuntu


πŸŽ“ Education

Master of Business Administration (MBA) β€” JNTU Anantapur (2017 – 2019)

Transitioned into Site Reliability Engineering through self-directed cloud study, hands-on Java/SQL lab work, and professional on-the-job experience.

πŸ“œ Certifications

  • πŸ”„ AWS Certified Solutions Architect – Associate (In Progress; Exam Scheduled 2026)

πŸ“¬ Contact Me

If you'd like to collaborate, ask a question, or just say hello β€” feel free to drop a message!

Email Mobile Location

πŸ“Š GitHub Stats

Metric Details
πŸ† Total Contributions Contributions
πŸ“‚ Languages Used Languages
⭐ Total Stars Stars

About

This is my profile

Resources

Stars

Watchers

Forks

Contributors