AWS Backend and Cloud Infrastructure Engineer
We are looking for a senior AWS backend engineer for a focused MVP delivery engagement. The app and
backend exist and are functional. The core problem is in the Lambda orchestration layer: Lambda invocations block synchronously waiting for SageMaker, there is no async job ID pattern, no queue, and no retry logic - causing timeouts at ~4 concurrent users. Your primary job is to fix this. In parallel, you will work on security hardening, observability, IaC, and cost hygiene. SageMaker knowledge is useful context but the real expertise needed is Lambda architecture and AWS async patterns.
Project Context
The platform is a video-based AI analysis product. Users submit videos through a mobile app; the backend processes them through a machine learning inference pipeline (SageMaker) and returns structured PDF
reports. The app and backend are fully in place and functional - this is not a greenfield build. The immediate problem is in the Lambda orchestration layer. Currently, Lambda invocations block synchronously while waiting for SageMaker to complete inference. There is no job ID pattern, no queue, and no retry logic. This causes Lambda timeouts at ~4 concurrent users, making the app unusable under any real load. Importantly, SageMaker itself is working well and handles concurrency and scaling correctly - the bottleneck is entirely upstream of it.
The first task is an architecture review to validate whether the current Lambda + SageMaker structure can be fixed in place, or whether it needs to be restructured. Everything else follows from that assessment. In parallel, security hardening (protection layers) runs as an independent workstream from day one.
What You Will Work On
1 ยท Security Hardening (P0 - runs in parallel from day one)
โ Enforce server-side encryption (SSE-KMS) on all S3 buckets; verify HTTPS/TLS across the full
request path
โ Conduct a full IAM least-privilege audit across Lambda, SageMaker, S3, and DynamoDB roles
โ Validate and harden the API Gateway custom authoriser (ApiKeyValidator + Secrets Manager);
ensure key rotation
โ Enable AWS WAF on API Gateway with rate limiting and common threat mitigations
โ Enable CloudTrail (12-month retention) and GuardDuty for audit trails and threat detection
โ Implement GDPR controls: right-to-erasure scripts, user data tagging, DynamoDB TTL, and S3
object expiration
โ Audit GitHub repositories for exposed secrets; enforce 2FA and secret scanning
โ Verify all resources are deployed in the designated EU region; add guardrails to prevent drift
2 ยท Concurrency & Async Architecture (P0 - critical path)
โ Architecture review first: assess whether the existing Lambda + SageMaker structure is
salvageable or needs restructuring before writing a single line of fix code
โ Implement the correct async job submission pattern in Lambda: accept request, persist job ID, return
ID to caller immediately, terminate Lambda invocation, let SageMaker process in background
โ This eliminates Lambda timeouts entirely - SageMaker already handles concurrency and scaling
correctly
โ Evaluate whether an SQS queue between Lambda and SageMaker is warranted, or if the job ID
pattern alone is sufficient for MVP
โ Add retry logic, idempotency checks, and request deduplication to prevent duplicate job submissions
โ Ensure job status is trackable: client polls or receives a webhook/SNS notification on completion or
failure
โ Implement parallel processing for multi-video jobs (Lambda fan-out per video/model rather than
sequential calls)
โ Right-size Lambda memory and timeout settings now that invocations are non-blocking and
short-lived
โ Validate the fix under load: run concurrent user tests to confirm the ~4-user ceiling is resolved
3 ยท Infrastructure Hygiene & IaC (P0/P1)
โ Migrate the manually provisioned stack to Terraform (or CDK): API Gateway, Lambdas, S3,
DynamoDB, SageMaker, IAM, CloudWatch
โ Set up isolated dev/staging and production environments; parameterise all environment-specific
values
โ Apply consistent naming conventions and resource tagging (Project, Environment, Owner,
ContainsPII)
โ Add input validation at the API layer for all metadata fields (licence plate, VIN, video count)
โ Ensure atomic metadata creation in DynamoDB with proper status tracking (Pending โ Processing
โ Done / Error)
โ Implement DLQ handling for async failures; add orphan-data cleanup Lambda for stale S3 objects
โ Configure S3 lifecycle rules and DynamoDB TTL for data minimisation and GDPR compliance
โ Decommission unused resources (old test Lambdas, inactive MediaConvert integration, etc.)
4 ยท Observability & Alerting (P0/P1)
โ Enforce structured logging with correlation IDs (job ID, user ID) across all Lambda functions and
SageMaker containers
โ Build a CloudWatch dashboard: upload throughput, Lambda duration/errors, SageMaker latency,
SQS depth, DynamoDB capacity
โ Configure CloudWatch Alarms for Lambda error counts, API Gateway 5XXs, SageMaker queue
depth, and concurrency exhaustion
โ Enable AWS X-Ray for end-to-end distributed tracing across API Gateway and Lambda
โ Enable S3 access logging and API Gateway access logs for security auditing
โ Implement a synthetic canary Lambda that runs a scheduled round-trip test and alerts on failure
โ Set up DLQs for all async integrations to prevent silent job drops
5 ยท Cost Optimisation (P1/P2)
โ Configure S3 lifecycle transitions/deletion for processed videos (e.g., 7-day retention) and long-term
reports
โ Review and right-size Lambda memory/timeout settings across all functions
โ Ensure SageMaker endpoints scale to zero when idle (Serverless or autoscaling to 0)
โ Set AWS Budget alerts and identify spend outliers via Cost Explorer
โ Identify and release idle resources (unattached EBS volumes, dev instances, always-on services)
Required Skills & Experience
Must-Have
โ 4+ years of hands-on AWS engineering experience in production environments
โ Deep expertise in: Lambda, API Gateway, SageMaker (async inference), S3, DynamoDB, SQS, IAM,
CloudWatch
โ Demonstrated experience debugging Lambda concurrency, timeout, and async invocation issues
โ Infrastructure as Code - Terraform or AWS CDK (not just CloudFormation console)
โ Ability to import and codify existing manually provisioned resources
โ Security engineering on AWS
โ IAM policy auditing, KMS, WAF, CloudTrail, GuardDuty, Secrets Manager
โ Familiarity with GDPR data residency, encryption-at-rest, and right-to-erasure requirements
โ Python - production-quality Lambda function development; familiarity with FastAPI a plus
โ Observability tooling - CloudWatch Alarms, Dashboards, X-Ray, structured JSON logging
โ CI/CD pipelines - GitHub Actions to ECR to SageMaker model deployment
โ Ability to work independently on a scoped task list with clear P0/P1/P2 prioritisation
Nice-to-Have
โ Experience with SageMaker Serverless Inference or Step Functions for orchestration
โ Familiarity with ML inference containers (Docker, ECR, model artefact management)
โ Experience with AWS Elemental MediaConvert or video processing pipelines
โ Exposure to AWS Savings Plans, Reserved Instances, or FinOps practices
โ Prior work on GDPR-compliant SaaS platforms
Ideal Candidate Profile
You are a hands-on AWS backend engineer who immediately recognises the Lambda timeout pattern
described above and knows exactly how to fix it. You understand async distributed systems from first principles - job ID patterns, SQS decoupling, idempotency, retry logic, DLQs - and can tell the difference
between a Lambda architecture problem and a SageMaker problem without needing to be told. You start every engagement with an architecture review before writing code, and you can accurately scope
whether a fix is a config change (SSE-KMS, WAF rule) or a genuine rework (Lambda invocation pattern, IaC migration). You communicate clearly on what you find, especially when the answer is 'we need to
restructure this before fixing it.' You treat security and observability as parallel workstreams, not things to address after the main work is
done
Required languages
| English | B2 - Upper Intermediate |
| Ukrainian | B1 - Intermediate |