Back to AMD
AMD logo
AMD·Bellevue, United States·onsite

Principal AI Systems Engineer – GPU Reliability & Customer Debug

$237,200 - $355,800 Posted 12 days ago
Apply on company site
Ruby optionalMid-Senior

About the role

What you'll do

  • Lead full-stack debug of AI infrastructure focusing on Reliability, Availability, and Serviceability (RAS) features on AMD GPU platforms.
  • Drive root cause analysis and resolution of customer issues using internal debug tools and hardware setups in data center environments.
  • Provide technical leadership and mentorship to cross-functional teams, fostering a culture of knowledge sharing and collaboration.
  • Collaborate with application and infrastructure architects to define and deliver technical architecture and ensure system operability.
  • Communicate and document system bring-up, boot-up, and initialization flows, and lead technical presentations to stakeholders.

What you should know

  • This role requires hands-on experience with hardware debugging and deep understanding of SoC architecture and server GPU platforms.
  • Candidates should be comfortable working onsite in Bellevue, WA, with direct customer interaction and cross-team collaboration.
  • Applicants will face the challenge of resolving complex multi-layer system issues involving hardware, firmware, and software.
  • The position offers opportunities to lead technical teams, influence architecture decisions, and work on cutting-edge AI infrastructure.
  • Strong communication skills are essential for customer-facing problem solving and coordinating with diverse technical teams.

About the company

  • AMD is a leading semiconductor company focused on next-generation computing including AI, data centers, gaming, and embedded systems.
  • The company culture emphasizes innovation, collaboration, humility, and inclusivity with a passion for solving important global challenges.
  • AMD values execution excellence and encourages bold ideas and diverse perspectives to drive progress.
  • It is a large, established tech company with a global presence and a strong commitment to equal opportunity employment.
  • AMD actively integrates AI technologies responsibly in its hiring and operational processes, reflecting a forward-thinking approach.

Key required skills

SoC architectureRAS debuggingPythonCC++PCIe - protocolFirmwareHardware debuggingGIT - version control

Summary generated from the original posting.