Back to all jobs
AMD logo
AMD·Oregon, United States·onsite

Principal AI Systems Engineer – GPU Reliability & Customer Debug

$204,000 - $306,000 Posted 15 days ago
Apply on company site
Ruby optionalMid-Senior

About the role

What you'll do

  • Lead full-stack debug of AI infrastructure focusing on Reliability, Availability, and Serviceability (RAS) features on AMD GPU platforms.
  • Drive root cause analysis and resolution of customer issues using internal debug tools and hardware probing in data center environments.
  • Provide technical leadership and mentorship to cross-functional debug teams and collaborate with architects on system design and operability.
  • Communicate and document system bring-up, boot-up, and initialization flows, and deliver technical presentations to stakeholders.
  • Hands-on role requiring expertise in SoC architecture, server CPU/GPU microcode, PCIe protocols, and troubleshooting complex hardware/software issues.

What you should know

  • This role is onsite in Oregon and requires hands-on experience with hardware in data center environments.
  • Applicants should be prepared for a customer-facing, technical leadership role involving complex debugging and cross-team collaboration.
  • Strong communication skills are essential for working with multiple stakeholders and resolving critical issues under pressure.
  • Candidates will benefit from experience with server architectures, firmware, microcode, and accelerator software workflows.
  • This position offers opportunities to work on cutting-edge AI GPU platforms and influence product reliability and performance at scale.

About the company

  • AMD is a leading semiconductor company focused on next-generation computing including AI, data centers, gaming, and embedded systems.
  • The company culture emphasizes innovation, collaboration, humility, and inclusivity with a passion for solving important global challenges.
  • AMD values execution excellence and diverse perspectives, fostering an environment where bold ideas and human ingenuity thrive.
  • As a large, established player in the technology and semiconductor industry, AMD drives advancements in AI and computing hardware.
  • AMD is committed to equal opportunity employment and inclusive hiring practices, supporting applicants throughout the recruitment process.

Key required skills

SoC architectureRAS debuggingPythonCC++PCIe - protocolFirmwareMicrocodeDebug toolsHardware troubleshooting

Summary generated from the original posting.