What you'll do
- Lead design, development, and operation of next-generation infrastructure for AI/ML and HPC workloads at cloud scale.
- Collaborate cross-functionally with software, hardware, network engineers, and operations teams to ensure high reliability and performance of AWS accelerator servers.
- Decompose complex server system issues into manageable tasks and deliver solutions using a combination of hardware, software, and system design expertise.
- Drive quality and reliability improvements throughout server conception, design, testing, launch, and operations phases.
- Act as a technical leader with strong organizational and communication skills to influence and deliver scalable, performant software solutions.
What you should know
- Ideal candidates should be innovative self-starters with deep knowledge across the full technical stack from hardware to userland software.
- The role involves working in a fast-paced, collaborative environment with diverse teams across hardware engineering and cloud services.
- Applicants must be comfortable with complex debugging and diagnosing issues in large-scale server systems.
- This position offers opportunities to influence the future of Generative AI infrastructure and cloud computing at massive scale.
- Candidates should be prepared to lead and deliver high-impact, reliable, and scalable solutions in production environments.
About the company
- Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform, serving startups to Global 500 companies.
- AWS fosters an inclusive culture that values diversity, curiosity, and employee-led affinity groups promoting equity and belonging.
- The company emphasizes work-life harmony and flexibility to support employees’ success both professionally and personally.
- AWS is committed to mentorship and continuous career growth, providing resources for knowledge sharing and professional development.
- Amazon is a large, global technology leader known for pioneering cloud computing and continuous innovation in AI and infrastructure.
Key required skills
Systems development and operations experience in Linux/Unix environmentsProficiency in programming with modern languages such as C++, Java, Python, or GolangExperience in designing and architecting scalable and reliable systemsStrong knowledge of hardware/software integration and x86 architectureProven ability to lead complex software or infrastructure projects from design through deployment