Site Reliability Engineer

23 hours ago


Taipei City, Taiwan GMI Cloud Full time NT$900,000 - NT$1,200,000 per year

Role Overview

We are seeking a skilled
Site Reliability Engineer
to join the GMI Global Infrastructure team. This role is hands-on and critical to ensuring the stability, efficiency, and reliability of the large-scale high performance AI/ML clusters in our data center. The ideal candidate will bring expertise in system-level troubleshooting, AI cluster maintenance, and operational excellence to ensure maximum performance for our infrastructure. Experience with large-scale infrastructure automation is considered a strong plus.

Responsibilities

  1. Design, implement and maintain scalable AI/ML infrastructure solutions.
  2. Proactively monitor GPU cluster health, performance and troubleshoot issues across compute, accelerator, and storage systems.
  3. Automate deployment, configuration and management of infrastructure resources.
  4. Manage GPU node lifecycle workflows, including provisioning, scaling, maintenance, decommissioning and upgrades of GPU nodes.
  5. Implement CI/CD pipelines for infrastructure deployment and orchestration.
  6. Ensure security, compliance and best practices across infrastructure.
  7. Manage incident response related to Infrastructure resources (GPU, CPU, Storage, Network).
  8. Handle customer provisioning requests for GPU resources, including onboarding, configuration and troubleshooting; resolve customer service requests related to GPU infrastructure, ensuring high customer satisfaction.
  9. Stay current with emerging GPU hardware and software technologies, integrating improvements as appropriate.
  10. Regional/international travel to GMI data center locations.

Qualifications

  1. Bachelor's degree in Computer Science or related field.
  2. Over 3+ years of experience in data center operations, infrastructure, or systems engineering.
  3. Proven experience in site reliability engineering and infrastructure automation (e.g. Ansible, Terraform)
  4. Familiarity with containers orchestration platform (e.g. Kubernetes, Nvidia GPU operator, Nvidia Network operator, CNI, CSI) and job scheduling systems (e.g. Slurm).
  5. Familiarity with Linux system administration and scripting (Python, Bash).
  6. Familiarity with logging and monitoring tools such as Prometheus, Grafana, Loki.
  7. Good knowledge of GPU architecture, Nvidia CUDA, NCCL, or related AI/ML frameworks - added advantage.
  8. Strong troubleshooting skills and ability to analyze system logs and performance metrics.
  9. Excellent communication and teamwork abilities.
  10. Meeting every qualification is not required—if you're excited about this role, we'd love to hear from you. We believe diverse perspectives and experiences strengthen our team.


  • Taipei, Taiwan BTSE Full time NT$900,000 - NT$1,200,000 per year

    About BTSE:彼特思方舟 is a specialized service provider dedicated to delivering a full spectrum of front-office and back-office support solutions, each of which are tailored to the unique needs of global financial technology firms. 彼特思方舟 is engaged by BTSE Group to offer several key positions, enabling the delivery of cutting-edge technology...


  • Taipei, Taipei City, Taiwan Ubiquiti Inc. Full time $90,000 - $120,000 per year

    About UbiquitiAt Ubiquiti Inc., we create technology platforms for Businesses, Smart Homes, and Internet Service Providers, driven by our goal to connect everyone, everywhere. To date, Ubiquiti has shipped over 100 million devices worldwide, from ISP networking products to next generation of IT solutions. Our growth is made possible by the dedicated team of...


  • Bade District, Taoyuan City, Taiwan Supermicro Full time NT$280,000 - NT$1,440,000 per year

    *Job Req ID: 26896About Supermicro*Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global...


  • Taipei City, Taiwan Snap Inc. Full time $100,000 - $150,000 per year

    Snap Inc is a technology company. We believe the camera presents the greatest opportunity to improve the way people live and communicate. Snap contributes to human progress by empowering people to express themselves, live in the moment, learn about the world, and have fun together. The Company's three core products are Snapchat, a visual messaging app that...


  • Hsinchu City,, Taiwan Ayar Labs Full time NT$120,000 - NT$180,000 per year

    Principal Engineer - Photonics ReliabilityLocation: Hsinchu (on-site)We are seeking an experienced Photonics Reliability Engineer to join our dynamic team. In this role, you will be responsible for reliability engineering, conducting failure analysis investigations, identifying root causes, and implementing corrective actions. You will collaborate closely...


  • Banqiao District, New Taipei City, Taiwan Google Full time NT$900,000 - NT$1,200,000 per year

    Google welcomes people with disabilities.Minimum qualifications:Bachelor's degree in Reliability Engineering, Electrical Engineering, Mechanical Engineering, or a relevant Engineering field, or equivalent practical experience.3 years of experience in lab testing or related experience with hardware evaluation methodologies, including lab measurements,...


  • Zhubei City,, Taiwan Nordic Semiconductor Full time NT$1,200,000 - NT$2,400,000 per year

    About The JobNordic Semiconductor seeks an experienced Product Reliability Engineer to join our Product Qualification Team in Taiwan. In this role, you will execute product-level qualification and reliability testing programs, support failure analysis efforts, and act as the main point of contact for foundry suppliers, wafer manufacturers, OSATs, and other...


  • Zhubei City, Taiwan Nordic Semiconductor Full time $60,000 - $120,000 per year

    About the jobNordic Semiconductor seeks an experienced Product Reliability Engineer to join our Product Qualification Team in Taiwan. In this role, you will execute product-level qualification and reliability testing programs, support failure analysis efforts, and act as the main point of contact for foundry suppliers, wafer manufacturers, OSATs, and other...


  • Taichung City, Taichung City, Taiwan Micron Technology Full time NT$1,200,000 - NT$2,400,000 per year

    Our vision is to transform how the world uses information to enrich life forall.Micron Technology is a world leader in innovating memory and storage solutions that accelerate the transformation of information into intelligence, inspiring the world to learn, communicate and advance faster than ever.Our vision is to transform how the world uses information to...


  • Hsinchu City,, Taiwan Arm Full time NT$900,000 - NT$1,200,000 per year

    Job ID Date posted19/09/2025LocationHsinchu, Taiwan; Taipei, TaiwanCategoryQualityJob OverviewAs a Staff Quality and Reliability Engineer at ARM, you will ensure the high standards and reliability of our products. Your responsibilities include developing quality control processes, conducting reliability tests, and performing failure analysis. You will...