What you'll do
- Lead full-stack debug of AI infrastructure focusing on Reliability, Availability, and Serviceability (RAS) features on AMD GPU platforms.
- Drive root cause analysis and resolution of customer issues using internal debug tools and hardware probing in data center environments.
- Provide technical leadership and mentorship to cross-functional debug teams and collaborate with architects on system design and operability.
- Communicate and document system bring-up, boot-up, and initialization flows, and deliver technical presentations to stakeholders.
- Hands-on role requiring expertise in SoC architecture, server CPU/GPU microcode, PCIe protocols, and troubleshooting complex hardware/software issues.
What you should know
- This role is onsite in Oregon and requires hands-on experience with hardware in data center environments.
- Applicants should be prepared for a customer-facing, technical leadership role involving complex debugging and cross-team collaboration.
- Strong communication skills are essential for working with multiple stakeholders and resolving critical issues under pressure.
- Candidates will benefit from experience with server architectures, firmware, microcode, and accelerator software workflows.
- This position offers opportunities to work on cutting-edge AI GPU platforms and influence product reliability and performance at scale.
About the company
- AMD is a leading semiconductor company focused on next-generation computing including AI, data centers, gaming, and embedded systems.
- The company culture emphasizes innovation, collaboration, humility, and inclusivity with a passion for solving important global challenges.
- AMD values execution excellence and diverse perspectives, fostering an environment where bold ideas and human ingenuity thrive.
- As a large, established player in the technology and semiconductor industry, AMD drives advancements in AI and computing hardware.
- AMD is committed to equal opportunity employment and inclusive hiring practices, supporting applicants throughout the recruitment process.
Key required skills
SoC architectureRAS debuggingPythonCC++PCIe - protocolFirmwareMicrocodeDebug toolsHardware troubleshooting