What you'll do
- Lead full-stack debug of AI infrastructure focusing on Reliability, Availability, and Serviceability (RAS) features on AMD GPU platforms.
- Drive root cause analysis and resolution of customer issues using internal debug tools and hardware setups in data center environments.
- Provide technical leadership and mentorship to cross-functional teams, fostering a culture of knowledge sharing and collaboration.
- Collaborate with application and infrastructure architects to define and deliver technical architecture and ensure system operability.
- Communicate and document system bring-up, boot-up, and initialization flows, and lead technical presentations to stakeholders.
What you should know
- This role requires hands-on experience with hardware debugging and deep understanding of SoC architecture and server GPU platforms.
- Candidates should be comfortable working onsite in Bellevue, WA, with direct customer interaction and cross-team collaboration.
- Applicants will face the challenge of resolving complex multi-layer system issues involving hardware, firmware, and software.
- The position offers opportunities to lead technical teams, influence architecture decisions, and work on cutting-edge AI infrastructure.
- Strong communication skills are essential for customer-facing problem solving and coordinating with diverse technical teams.
About the company
- AMD is a leading semiconductor company focused on next-generation computing including AI, data centers, gaming, and embedded systems.
- The company culture emphasizes innovation, collaboration, humility, and inclusivity with a passion for solving important global challenges.
- AMD values execution excellence and encourages bold ideas and diverse perspectives to drive progress.
- It is a large, established tech company with a global presence and a strong commitment to equal opportunity employment.
- AMD actively integrates AI technologies responsibly in its hiring and operational processes, reflecting a forward-thinking approach.
Key required skills
SoC architectureRAS debuggingPythonCC++PCIe - protocolFirmwareHardware debuggingGIT - version control