Site Reliability Engineer
23 hours ago
Role Overview
We are seeking a skilled
Site Reliability Engineer
to join the GMI Global Infrastructure team. This role is hands-on and critical to ensuring the stability, efficiency, and reliability of the large-scale high performance AI/ML clusters in our data center. The ideal candidate will bring expertise in system-level troubleshooting, AI cluster maintenance, and operational excellence to ensure maximum performance for our infrastructure. Experience with large-scale infrastructure automation is considered a strong plus.
Responsibilities
- Design, implement and maintain scalable AI/ML infrastructure solutions.
- Proactively monitor GPU cluster health, performance and troubleshoot issues across compute, accelerator, and storage systems.
- Automate deployment, configuration and management of infrastructure resources.
- Manage GPU node lifecycle workflows, including provisioning, scaling, maintenance, decommissioning and upgrades of GPU nodes.
- Implement CI/CD pipelines for infrastructure deployment and orchestration.
- Ensure security, compliance and best practices across infrastructure.
- Manage incident response related to Infrastructure resources (GPU, CPU, Storage, Network).
- Handle customer provisioning requests for GPU resources, including onboarding, configuration and troubleshooting; resolve customer service requests related to GPU infrastructure, ensuring high customer satisfaction.
- Stay current with emerging GPU hardware and software technologies, integrating improvements as appropriate.
- Regional/international travel to GMI data center locations.
Qualifications
- Bachelor's degree in Computer Science or related field.
- Over 3+ years of experience in data center operations, infrastructure, or systems engineering.
- Proven experience in site reliability engineering and infrastructure automation (e.g. Ansible, Terraform)
- Familiarity with containers orchestration platform (e.g. Kubernetes, Nvidia GPU operator, Nvidia Network operator, CNI, CSI) and job scheduling systems (e.g. Slurm).
- Familiarity with Linux system administration and scripting (Python, Bash).
- Familiarity with logging and monitoring tools such as Prometheus, Grafana, Loki.
- Good knowledge of GPU architecture, Nvidia CUDA, NCCL, or related AI/ML frameworks - added advantage.
- Strong troubleshooting skills and ability to analyze system logs and performance metrics.
- Excellent communication and teamwork abilities.
- Meeting every qualification is not required—if you're excited about this role, we'd love to hear from you. We believe diverse perspectives and experiences strengthen our team.
-
Senior Site Reliability Engineer
7 days ago
Taipei, Taiwan BTSE Full time NT$900,000 - NT$1,200,000 per yearAbout BTSE:彼特思方舟 is a specialized service provider dedicated to delivering a full spectrum of front-office and back-office support solutions, each of which are tailored to the unique needs of global financial technology firms. 彼特思方舟 is engaged by BTSE Group to offer several key positions, enabling the delivery of cutting-edge technology...
-
Senior Site Reliability Engineer
1 week ago
Taipei, Taipei City, Taiwan Ubiquiti Inc. Full time $90,000 - $120,000 per yearAbout UbiquitiAt Ubiquiti Inc., we create technology platforms for Businesses, Smart Homes, and Internet Service Providers, driven by our goal to connect everyone, everywhere. To date, Ubiquiti has shipped over 100 million devices worldwide, from ISP networking products to next generation of IT solutions. Our growth is made possible by the dedicated team of...
-
Site Reliability Engineer-AI Cloud
2 weeks ago
Bade District, Taoyuan City, Taiwan Supermicro Full time NT$280,000 - NT$1,440,000 per year*Job Req ID: 26896About Supermicro*Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global...
-
Reliability Engineer
7 days ago
Taipei City, Taiwan Snap Inc. Full time $100,000 - $150,000 per yearSnap Inc is a technology company. We believe the camera presents the greatest opportunity to improve the way people live and communicate. Snap contributes to human progress by empowering people to express themselves, live in the moment, learn about the world, and have fun together. The Company's three core products are Snapchat, a visual messaging app that...
-
Principal Engineer, Photonics Reliability
5 days ago
Hsinchu City,, Taiwan Ayar Labs Full time NT$120,000 - NT$180,000 per yearPrincipal Engineer - Photonics ReliabilityLocation: Hsinchu (on-site)We are seeking an experienced Photonics Reliability Engineer to join our dynamic team. In this role, you will be responsible for reliability engineering, conducting failure analysis investigations, identifying root causes, and implementing corrective actions. You will collaborate closely...
-
Hardware Reliability Test Engineer, Reliability Lab
23 hours ago
Banqiao District, New Taipei City, Taiwan Google Full time NT$900,000 - NT$1,200,000 per yearGoogle welcomes people with disabilities.Minimum qualifications:Bachelor's degree in Reliability Engineering, Electrical Engineering, Mechanical Engineering, or a relevant Engineering field, or equivalent practical experience.3 years of experience in lab testing or related experience with hardware evaluation methodologies, including lab measurements,...
-
Product Reliability Engineer
3 days ago
Zhubei City,, Taiwan Nordic Semiconductor Full time NT$1,200,000 - NT$2,400,000 per yearAbout The JobNordic Semiconductor seeks an experienced Product Reliability Engineer to join our Product Qualification Team in Taiwan. In this role, you will execute product-level qualification and reliability testing programs, support failure analysis efforts, and act as the main point of contact for foundry suppliers, wafer manufacturers, OSATs, and other...
-
Product Reliability Engineer
1 week ago
Zhubei City, Taiwan Nordic Semiconductor Full time $60,000 - $120,000 per yearAbout the jobNordic Semiconductor seeks an experienced Product Reliability Engineer to join our Product Qualification Team in Taiwan. In this role, you will execute product-level qualification and reliability testing programs, support failure analysis efforts, and act as the main point of contact for foundry suppliers, wafer manufacturers, OSATs, and other...
-
Taichung City, Taichung City, Taiwan Micron Technology Full time NT$1,200,000 - NT$2,400,000 per yearOur vision is to transform how the world uses information to enrich life forall.Micron Technology is a world leader in innovating memory and storage solutions that accelerate the transformation of information into intelligence, inspiring the world to learn, communicate and advance faster than ever.Our vision is to transform how the world uses information to...
-
Staff Quality and Reliability Engineer
1 week ago
Hsinchu City,, Taiwan Arm Full time NT$900,000 - NT$1,200,000 per yearJob ID Date posted19/09/2025LocationHsinchu, Taiwan; Taipei, TaiwanCategoryQualityJob OverviewAs a Staff Quality and Reliability Engineer at ARM, you will ensure the high standards and reliability of our products. Your responsibilities include developing quality control processes, conducting reliability tests, and performing failure analysis. You will...