What you'll do
- Lead design and delivery of next-generation AWS platforms focused on AI/ML and HPC workloads.
- Own and proactively improve server system reliability, testability, and diagnosis using hardware and software expertise.
- Collaborate cross-functionally with engineers, TPMs, and managers across AWS Hardware Engineering and other AWS services.
- Develop and implement tactical solutions to complex architectural problems impacting cloud-scale AI training and inference.
- Drive continuous price-performance improvements for multi-billion parameter large language model (LLM) training infrastructure.
What you should know
- This role requires a self-starter mindset with strong organizational and communication skills.
- Candidates should be comfortable working in a fast-paced, growing, and collaborative team environment.
- Opportunity to have direct impact on AWS product improvements and bottom line through ownership of deliverables.
- Applicants should expect to work onsite in Seattle with global teams and cross-disciplinary roles.
- Ideal for those passionate about cloud-scale systems and the intersection of hardware and software in AI infrastructure.
About the company
- Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform with a focus on innovation.
- AWS Hardware Engineering emphasizes frugality, operational excellence, and architectural soundness in server design.
- The company values an inclusive culture with employee-led affinity groups and diversity initiatives like CORE and AmazeCon.
- AWS supports a work-life harmony culture promoting flexibility and employee well-being.
- Amazon is a large, global technology leader with a strong commitment to mentorship, career growth, and diverse experiences.
Key required skills
Strong knowledge of systems engineering fundamentals including networking, storage, and operating systems.Proficiency in at least one modern programming language such as C++, Python, Java, Golang, or PowerShell.Experience with designing or architecting scalable and reliable systems.Hands-on experience with server hardware and software stack debugging and diagnostics.Familiarity with Agile methodologies and Scrum preferred.