╔═══════════════════════════════════════════════════════╗
║ AAYUSHI GUPTA · Software Development Engineer ║
║ Backend Systems · Distributed Architecture · AWS ║
╚═══════════════════════════════════════════════════════╝
Backend engineer with 3+ years building production systems across Amazon, Pine Labs, and Niyo — distributed architectures, event-driven pipelines, cross-region migrations, payment infrastructure, and real-time notification systems at scale.
I design APIs, own systems end-to-end, mentor engineers, and ship things that don't break at 3AM.
Engineer engineer = Engineer.builder()
.experience(List.of("Amazon", "Pine Labs", "Niyo"))
.focus("Distributed Systems · Event-Driven Architecture · Cloud Infrastructure")
.languages(List.of("Java", "Go", "TypeScript", "Python"))
.cloud("AWS — ECS, Lambda, SQS/SNS, DynamoDB, OpenSearch, CDK")
.ai(List.of("LangChain", "RAG pipelines", "LLM APIs", "Vector DBs"))
.certifications("AWS Certified")
.dsaSolved(500)
.build();Real work. Real scale. No toy projects.
Owned end-to-end migration of a stateful ECS service from eu-west-1 → eu-south-2 as part of Amazon's Regional Flex Planning initiative. No prior playbook existed for this in the org — I wrote it.
- Designed the migration architecture from scratch: phased approach — per-tenant S3 cross-region replication + OpenSearch snapshot-restore, followed by gradual WebLab traffic ramp (5% → 25% → 50% → 100%), followed by formal decommission via change management
- Solved the stateful data problem: validated per-tenant S3 object counts and OpenSearch document counts before touching a single traffic percentage — data parity was the go/no-go gate, not a timer
- Separated "zero traffic" from "delete resources" by a full 7-day monitoring window — decommission is irreversible; I treated it accordingly
- Owned the decommission sequence: endpoint removal → compute deletion → data deletion — order matters because running compute without data causes cascading failures
- Coordinated multiple client teams across Alpha and Prod accounts; wrote the runbook so the next engineer doesn't start from zero
Zero downtime · Zero data loss · 1M+ customers · Full org-level playbook created
Designed and built a compliance-grade, event-driven customer data deletion system across 3+ microservices. The correctness bar here is non-negotiable — a missed deletion is a compliance violation.
SNS (deletion event) → SQS (buffered queue) → Lambda → DynamoDB tables
↓ → External state API
DLQ (failure capture + alarm)
- Idempotency by design: deletion state tracked in a DynamoDB tracking table keyed by
customerId + requestId— SQS at-least-once delivery means duplicate processing is guaranteed to happen; the system handles it safely - Solved the distributed transaction gap: DynamoDB delete succeeds but external API call fails → message returns to queue → DynamoDB delete is a no-op on retry → external API gets called again. Chose data-deletion-first ordering deliberately: missing a state update is recoverable; data existing after a deletion request is a compliance violation
- DLQ + CloudWatch alarm ensures no deletion silently fails — every failure is captured, alerted, and replayable after root cause fix
- Chose Lambda over persistent ECS: deletion is bursty and infrequent — Lambda scales to zero and only costs on execution
GDPR right-to-erasure · 100K+ customers · At-least-once safe · Zero silent failures
Built a multi-region change data capture pipeline to make operational DynamoDB data queryable for analytics without impacting production read capacity.
- Designed schema transformation layer: DDB's typed JSON format (
{S: "val"},{BOOL: true}) → flat TSV rows compatible with the columnar store — column order and type mapping owned entirely by me - Solved the backfill problem: DDB Streams only captures future writes. For 1.3M existing records (1.7GB), used a temp-table strategy — copy prod data to a temp DDB table with streams enabled, run a parallel DataCraft pipeline into the same Andes destination, then activate the prod stream pipeline. Live writes and backfill converge safely because stream events carry timestamps
- Debugged a production row count mismatch (expected 1255, found 1254) — ruled out data loss by querying Andes directly, identified root cause as a manifest generation bug in pipelines created before a certain DataCraft version flag became default. Fix: recreate pipeline with flag first, then recreate Datashare — order matters because recreating Datashare before fixing the pipeline reads from the same broken manifest
- Deployed across EU, NA, and FE regions with consistent schema
Multi-region · 1.3M records backfilled · Production bug debugged and fixed · Downstream Redshift unblocked
Converted synchronous IT workflows that were timing out under peak load into an event-driven async model.
- Identified root cause: synchronous call chains where upstream service waited on downstream completion — under load, downstream slowness caused cascading timeouts across the entire chain
- Redesigned to publish-and-forget: services emit events to SQS, downstream consumers process at their own pace with idempotency guards for at-least-once delivery
- Added DLQ + visibility timeout tuning to prevent message loss under sustained load spikes
- Execution time: 50 minutes → 25 minutes. Production timeouts: eliminated.
Led API design and technical ownership for Pine Labs' multi-gateway payment integration layer, with junior engineers implementing individual gateway integrations under my design.
- Designed the unified gateway abstraction: a single internal API contract that normalized heterogeneous external gateway interfaces — each gateway had different auth schemes, retry semantics, error codes, and idempotency models. Abstraction layer hid all of this from callers
- Owned the resilience contract: defined how retries, timeouts, and idempotency keys worked at the abstraction layer — individual gateway implementations had to conform, not invent their own retry logic
- Led code reviews with a specific focus on failure modes: "what happens if this gateway returns a 200 but the transaction is actually pending?", "how does this handle a network timeout mid-request?" — taught juniors to think in failure paths, not happy paths
- Drove reconciliation flow design for failed or ambiguous transactions — financial systems need a recovery path, not just error logging
5 gateways integrated · Unified abstraction owned · Junior engineers mentored on production-grade error handling
Built a Kafka-based real-time notification pipeline delivering push, SMS, and email alerts to users on transaction events — replacing a batch-based approach that introduced unacceptable delivery delays.
Transaction Event → Kafka Topic → Notification Consumer Service
→ Push (FCM/APNs)
→ SMS (provider)
→ Email (provider)
- Designed consumer group configuration for fault-tolerant, ordered processing — partition assignment ensured per-user event ordering was preserved across notification channels
- Handled the fan-out routing problem: a single Kafka message needed to trigger multiple notification channels based on user preferences and event type — built a routing layer inside the consumer that dispatched to the right provider without duplicating event consumption
- Implemented offset commit strategy carefully: committed offsets only after all notification dispatches succeeded — a failed SMS dispatch would not silently drop the message, it would retry from the last committed offset
- Reduced notification delivery latency from batch-cycle delays to near real-time
Kafka · Multi-channel fan-out · Ordered delivery · Fault-tolerant offset management
Built a scheduled reporting pipeline from scratch because it was toil that should not exist. Nobody asked me to — I identified it and eliminated it.
- EventBridge cron → Lambda → DynamoDB scan → in-memory CSV generation (not /tmp — Lambda's ephemeral filesystem is cleaned between invocations; in-memory is faster and has no cleanup cost) → S3 archive + SES email delivery
- Made the CDK stack generic: accepts a
LambdaConfiginterface so any future scheduled report reuses the same construct — no copy-paste infrastructure - When the stack was accidentally deleted during the ZAZ migration decommission, I rebuilt it and encoded 4 production lessons directly into CDK:
RemovalPolicy.RETAINon S3 (survives stack deletion), in-memory CSV, generic stack, CloudWatch error alarm (original had zero observability — failures were invisible until someone noticed a missing email)
Toil eliminated · Infrastructure made deletion-proof · Observability added · CDK construct reusable
AI isn't a line on my resume — it's part of how I build and how I work.
Building with AI — flat/flatmates (side project, working prototype):
Stack: Next.js · Java Spring Boot · LLM API
Architecture:
├── Preference intake → structured user profile (Spring Boot)
├── Compatibility scoring via LLM API — prompt engineered for
│ deterministic structured output (JSON), not free-form text
├── RAG layer: user-generated descriptions embedded + retrieved
│ at match time to give LLM relevant context per query
├── Cold-start strategy: new users with no history get rule-based
│ scoring until enough signal exists to switch to LLM scoring
└── Cost control: LLM called only at match-time, not on every
profile update — cached embeddings, selective inference
Using AI in daily engineering:
Tools: Cursor · GitHub Copilot
What for: Boilerplate elimination · Test case generation
Debugging hypothesis generation · PR description drafting
What not: Architecture decisions · Production incident RCA
Anything where I need to own the reasoning
Actively studying: LangChain internals · Pinecone / Weaviate · RAG vs fine-tuning decision framework
┌─────────────────┬──────────────────────────────────────────────────┐
│ Languages │ Java · Go · TypeScript · Python · C++ │
│ Cloud (AWS) │ ECS · Lambda · SQS · SNS · DynamoDB · S3 │
│ │ OpenSearch · EventBridge · CDK · CloudWatch │
│ Messaging │ Kafka · SNS/SQS · Event-driven architecture │
│ Databases │ DynamoDB · MongoDB · MySQL · Redis │
│ Frameworks │ Spring Boot · Next.js · React │
│ Infrastructure │ Docker · AWS CDK · CloudFormation │
│ Observability │ CloudWatch · Log Insights · Alarms · DLQ │
│ AI/ML │ LangChain · LLM APIs · RAG · Vector embeddings │
│ Testing │ Cypress · Parallel sharding · Integration tests │
└─────────────────┴──────────────────────────────────────────────────┘
// Principles I've developed from shipping real systems
1. Separate data migration from traffic cutover — never do both simultaneously
2. Decommission is irreversible; treat it differently from deployment
3. Idempotency is not optional when your delivery guarantee is at-least-once
4. A metadata bug is not data corruption — identify which one before alerting anyone
5. Design for the failure path first; the happy path usually works
6. Payment systems fail in creative ways — build reconciliation in, not as an afterthought
7. When forced to rebuild, encode what production taught you directly into the infrastructure
8. Every technical decision should have a stated "what breaks and when"
9. Teach engineers to think in failure modes, not just correct behavior
10. If something is toil, eliminate it — don't document a workaround
$ curl -X GET https://linkedin.com/in/guptaaayushi09
$ curl -X GET https://leetcode.com/code_buddy21
$ echo "aayushi09023@gmail.com"