What you'll do
- Lead design, delivery, and operation of next-generation infrastructure for AI/ML and HPC workloads at cloud scale.
- Collaborate cross-functionally with software, hardware, network engineers, and operations teams to ensure high reliability and scalability.
- Decompose complex server system problems into manageable tasks and lead their implementation using a combination of hardware and software expertise.
- Drive quality and reliability improvements in AWS accelerated server solutions through system design, testing, and diagnostics.
- Act as a technical leader with strong organizational and communication skills, owning solutions from conception through production.
What you should know
- Ideal candidates are innovative self-starters with deep knowledge across the full technical stack from hardware to userland software.
- The role requires strong skills in systems debugging, diagnosis, and performance optimization in complex server environments.
- Applicants should be comfortable working in a fast-paced, cross-disciplinary team involving engineers, TPMs, and managers.
- This position offers opportunities to impact the future of Generative AI infrastructure at cloud scale.
- Candidates should expect to engage in complex problem-solving and lead efforts to improve system reliability and scalability.
About the company
- Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform, pioneering cloud computing innovation.
- AWS fosters an inclusive culture that values diversity, encourages bold ideas, and supports employee resource groups and learning events.
- The company emphasizes work-life balance and flexibility to support success both at work and at home.
- AWS is committed to mentorship and continuous career growth, providing resources to develop well-rounded professionals.
- Amazon is a large, global technology leader with a strong focus on customer trust, security, and operational excellence.
Key required skills
4+ years of professional software development experience with modern languages (C++, Java, Python, Golang, etc.)Experience deploying and operating systems in Linux/Unix environmentsStrong background in systems development, design, and operations in IT or data center environmentsProven ability to design and architect reliable, scalable software and hardware systemsExperience debugging and validating complex AI/ML and cloud computing servers